API¶
This part of the documentation lists the full API reference of all public classes and functions.
Data Splitting¶
- class preprocessy.data_splitting.Split¶
Class for splitting the dataset into train and test sets
- train_test_split(params)¶
Performs train test split on the input data
- Parameters
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
target_label (str) – Name of the Target Column.
test_size (float, int) – Size of test set after splitting. Can take values from 0 - 1 for float point values, 0 - Number of samples for integer values. Is complementary to train size.
train_size (float, int) – Size of train set after splitting. Can take values from 0 - 1 for float point values, 0 - Number of samples for integer values. Is complementary to test size.
shuffle (bool, default = False) – Decides whether to shuffle data before splitting.
random_state (int) – Seeding to be provided for shuffling before splitting.
The functions inserts the following into
params
-If target label is provided
X_train : pandas.core.frames.DataFrame
y_train : pandas.core.series.Series
X_test : pandas.core.frames.DataFrame
y_test : pandas.core.series.Series
Else
train: pandas.core.frames.DataFrame
test: pandas.core.frames.DataFrame
- Raises
ValueError – If the target column does not have a
name
propertyValueError
is raised.
- class preprocessy.data_splitting.KFold¶
Class for splitting input data into K-folds. Split dataset into K consecutive folds (without shuffling by default).
Each fold is then used once as a validation while the
k - 1
remaining folds form the training set.- get_n_splits()¶
Returns the number of splitting iterations in the cross-validator
- Returns
Returns the number of splitting iterations in the cross-validator.
- Return type
int
- split(params)¶
Generate indices to split data into training and test set.
- Parameters
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
n_splits (int, default = 5) – Number of folds. Must be at least 2
shuffle (bool, default = False) – Whether to shuffle the data before splitting into folds
random_state (int, default = None) – Random state used for shuffling
- Yield
The training set indices for that split
- Return type
ndarray
- Yield
The testing set indices for that split
- Return type
ndarray
- Raises
ValueError – If
n_splits > n_samples
, then aValueError
is raised
Encoding¶
- class preprocessy.encoding.Encoder¶
Class to encode categorical and ordinal features. Categorical encoding options include:
normal
andone-hot
- encode(params)¶
Function to encode categorical or ordinal columns.
- Parameters
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
target_label (str) – Name of the Target Column. This parameter is needed to ensure that the target column isn’t identified as categorical and encoded.
cat_cols (list) – List containing the column names to be encoded categorically
ord_dict (dict) – Dictionary with the the key as name of column to be encoded ordinally and the corresponding value is the dictionary containing the mapping.
one_hot – This parameter takes True or False to indicate whether the user wants to encode using one-hot.
Feature Selection¶
- class preprocessy.feature_selection.SelectKBest¶
Class for finding K highest scoring features among the set of all features. Takes a feature and finds its correlation with the target label using a scoring function.
Scoring functions include:
f_regression
mutual_info_regression
f_classif
mutual_info_classif
chi2
All scoring functions are provided in
sklearn.feature_selection
.- fit(params)¶
Function that fits the scoring function over (X_train, y_train) and generates the scores and pvalues for all features with the target label. If no scoring function is passed, then defaults to
f_classify
orf_regression
based on the predictive problem.- Parameters
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
target_label (str) – Name of the Target Column.
score_func (callable) – Function taking two arrays X and y, and returning a pair of arrays
(scores, pvalues)
or a single array with scores.score_func
can be custom or used fromsklearn.feature_selection
k (int) – Number of top features to select.
- Raises
TypeError – The scoring function should be a callable.
- fit_transform(params)¶
Does fit() and transform() in single step
- Parameters
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
target_label (str) – Name of the Target Column.
score_func (callable) – Function taking two arrays X and y, and returning a pair of arrays
(scores, pvalues)
or a single array with scores.score_func
is provided fromsklearn.feature_selection
k (int) – Number of top features to select.
- Raises
TypeError – The scoring function should be a callable.
ValueError – No features are selected when
k = 0
ValueError – After performing
fit()
,ValueError
is raised if length of mask and number oftrain_df
ortest_df
columns do not match.
- transform(params)¶
Function to reduce
train_df
andtest_df
to the selected features. Adds dataframes of shape(n_samples, k)
toparams
- Parameters
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
target_label (str) – Name of the Target Column.
score_func (callable) – Function taking two arrays X and y, and returning a pair of arrays
(scores, pvalues)
or a single array with scores.score_func
is provided fromsklearn.feature_selection
k (int) – Number of top features to select.
- Raises
ValueError – No features are selected when
k = 0
ValueError – After performing
fit()
,ValueError
is raised if length of mask and number oftrain_df
ortest_df
columns do not match.
Input¶
- class preprocessy.input.Reader¶
Standard Reader Class that serves to read and load numeric data into pandas dataframe.
The file extensions allowed are: .csv, .xls, .xlxs, .xlsm, .xlsb, .odf, .ods, .odt
- read_file(params)¶
Function to take the train and test dataframe paths and load it in pandas dataframe
- Parameters
train_df_path (str) – Path that points to the train dataset(Extension can be any of the above listed). Should not be
None
.test_df_path (str) – Path that points to the test dataset(Extension can be any of the above listed).
Null Values¶
- class preprocessy.missing_data.NullValuesHandler¶
- execute(params)¶
Function that handles null values in the supplied dataframe and returns a new dataframe. If no user parameters are supplied, the rows containing null values are dropped by default.
- Parameters
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
cat_cols (list) – List containing the names of categorical columns
replace_cat_nulls (str) – The value which will replace null values in the categorical columns
drop_cols (list) – List of column names of columns to be dropped
fill_missing (dict) – Dictionary of format {“method”: [col_list]} to indicate the method (
mean
/median
) to be applied on specified col_listfill_values (dict) – Column and value mapping, where the key is the column name and value is the custom value to be filled in place of null values
Outliers¶
- class preprocessy.outliers.HandleOutlier¶
Class for handling outliers on its own or according to users needs. This class handles outliers using the percentile concept. The 2 percentile markers will represent the data to be kept ie if one marker is 5th percentile and other is 95th percentile then the data between this range is kept.
Here we use the percentile to calculate the data points they point to and keep those data points that are in the range of the 2 calculated data points
- handle_outliers(params)¶
This function is used to handle outliers is flexible in how to calculate the percentiles and what to do about the outliers.
- Parameters
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
target_label (str) – Name of the Target Column. This parameter is needed to ensure that the target column isn’t included in the outlier removing process.
cat_cols (List) – List containing the column names to be encoded categorically. This parameter is needed to ensure that the categorical columns isn’t included in the outlier removing process.
remove_outliers (bool, default=True) – Boolean value to indicate whether user wants to remove outlier or not.
ord_cols (List) – List containing the column names to be encoded ordinally.This parameter is needed to ensure that the ordinal columns isn’t included in the outlier removing process.
replace (bool, default=False) – Boolean value to indicate if the outliers need to be replaced. This will replace the outliers with
-999
and will not remove them.first_quartile (float) – Float value <1 representing the first percentile marker
third_quartile (float) – Float value <1 representing the other percentile marker.
Scaling¶
- class preprocessy.scaling.Scaler¶
- execute(params)¶
Method for scaling the columns in a dataset
- Parameters
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
type ("MinMaxScaler" | "BinaryScaler" | "StandardScaler") – The type of Scaler to be used
columns (list) – List of columns in the dataframe
cat_cols (list) – List containing the names of categorical columns
target_label (str) – Name of the Target Column. This parameter is needed to ensure that the target column isn’t scaled
is_combined (bool) – Parameter to determine whether columns should be scaled together as a group
threshold (dict) – Dictionary of threshold values where the key is the column name and the value is the threshold for that column.
Pipelines¶
- class preprocessy.pipelines.BasePipeline(train_df_path=None, test_df_path=None, steps=None, config_file=None, params=None, custom_reader=None)¶
The
BasePipeline
Class can be used to create your own customized pipeline.- Parameters
train_df_path (str) – Path to train dataframe Should not be
None
test_df_path (str) – Path to train dataframe Should not be
None
steps (list) – A list of functions which will be executed sequentially. All the functions should be callables
params (dict) – A dictionary containing the parameters that are needed for configuring the pipeline
config_file (str) – Path to a config file that consists the parameters for configuring the pipeline. An alternative to
params
. A config file for the currentparams
dictionary can be generated using thesave_config
utilitycustom_reader (callable) – Custom function to read the data
Changed in version 1.0.4:
params
field of the pipeline is made private and is initialised as a deepcopy of the passed in parameter dict.- add(func=None, params=None, **kwargs)¶
Method to add another function to the pipeline after it has been constructed. The parameters of the newly added function are merged with
self.__params
.- Parameters
func (callable) – The function to be added
params (dict) – Dictionary of configurable parameters to be added to the existing
params
dictionary. Can be empty orNone
.index (int) – The index at which the function is to be inserted.
after (str) – The step name after which the function should be added
before (str) – The step name before which the function should be added
To add a function, either the index position or the
before
/after
positional arguments can be suppliedIf
index
,after
andbefore
are all provided, the method will follow the priority:index
>after
>before
- Raises
ArgumentsError – If no position is provided to insert the function into the pipeline
ValueError – If
params
contains a key that already exists inself.__params
- get_params()¶
Returns the parameter dictionary
- Returns
Parameter dictionary
- Return type
dict
New in version 1.0.4.
- print_info()¶
Prints the current configuration of the pipeline. Shows the steps, dataframe paths and config paths.
- process()¶
Method that executes the pipeline sequentially.
- remove(func_name=None)¶
Method to remove a function from the pipeline
- Parameters
func_name (str) – The name of the function which has to be removed from the pipeline
- Raises
TypeError – If
func_name
is not of typestr
- save_config(file_path, config_drop_keys=None)¶
Method to save the
params
to aJSON
config file.- Parameters
file_path (str) – Path where the config file must be created
config_drop_keys (list, optional) – List of param keys that must not be stored in the config file, defaults to
["train_df", "test_df", "X_train", "X_test", "y_train", "y_test"]
- set_params(params)¶
Set and overwrite the parameter dictionary initialised in the constructor
- Parameters
params (dict) – A dictionary containing the parameters that are needed for configuring the pipeline.
- Raises
ArgumentsError –
params
should not be a dictionary.
New in version 1.0.4.
- class preprocessy.pipelines.FeatureSelectionPipeline(train_df_path=None, test_df_path=None, steps=None, config_file=None, params=None, custom_reader=None)¶
Pre-built pipeline that can be used for feature selection
The steps of the pipeline are:
Parser
NullValuesHandler
Encoder
HandleOutlier
Scaler
SelectKBest
Split
- class preprocessy.pipelines.StandardPipeline(train_df_path, test_df_path=None, config_file=None, params=None, custom_reader=None)¶
Pre-built generic pipeline that can be used for most datasets
The steps of the pipeline are:
Parser
NullValuesHandler
Encoder
HandleOutlier
Scaler
Split