API¶
This part of the documentation lists the full API reference of all public classes and functions.
Data Splitting¶
- class preprocessy.data_splitting.Split¶
Class for splitting the dataset into train and test sets
- train_test_split(params)¶
Performs train test split on the input data
- Parameters:
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetest_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
target_label (str) – Name of the Target Column.
test_size (float, int) – Size of test set after splitting. Can take values from 0 - 1 for float point values, 0 - Number of samples for integer values. Is complementary to train size.
train_size (float, int) – Size of train set after splitting. Can take values from 0 - 1 for float point values, 0 - Number of samples for integer values. Is complementary to test size.
shuffle (bool, default = False) – Decides whether to shuffle data before splitting.
random_state (int) – Seeding to be provided for shuffling before splitting.
The functions inserts the following into
params-If target label is provided
X_train : pandas.core.frames.DataFrame
y_train : pandas.core.series.Series
X_test : pandas.core.frames.DataFrame
y_test : pandas.core.series.Series
Else
train: pandas.core.frames.DataFrame
test: pandas.core.frames.DataFrame
- Raises:
ValueError – If the target column does not have a
namepropertyValueErroris raised.
- class preprocessy.data_splitting.KFold¶
Class for splitting input data into K-folds. Split dataset into K consecutive folds (without shuffling by default).
Each fold is then used once as a validation while the
k - 1remaining folds form the training set.- get_n_splits()¶
Returns the number of splitting iterations in the cross-validator
- Returns:
Returns the number of splitting iterations in the cross-validator.
- Return type:
int
- split(params)¶
Generate indices to split data into training and test set.
- Parameters:
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetest_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
n_splits (int, default = 5) – Number of folds. Must be at least 2
shuffle (bool, default = False) – Whether to shuffle the data before splitting into folds
random_state (int, default = None) – Random state used for shuffling
- Yield:
The training set indices for that split
- Return type:
ndarray
- Yield:
The testing set indices for that split
- Return type:
ndarray
- Raises:
ValueError – If
n_splits > n_samples, then aValueErroris raised
Encoding¶
- class preprocessy.encoding.Encoder¶
Class to encode categorical and ordinal features. Categorical encoding options include:
normalandone-hot- encode(params)¶
Function to encode categorical or ordinal columns.
- Parameters:
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetest_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
target_label (str) – Name of the Target Column. This parameter is needed to ensure that the target column isn’t identified as categorical and encoded.
cat_cols (list) – List containing the column names to be encoded categorically
ord_dict (dict) – Dictionary with the the key as name of column to be encoded ordinally and the corresponding value is the dictionary containing the mapping.
one_hot – This parameter takes True or False to indicate whether the user wants to encode using one-hot.
Feature Selection¶
- class preprocessy.feature_selection.SelectKBest¶
Class for finding K highest scoring features among the set of all features. Takes a feature and finds its correlation with the target label using a scoring function.
Scoring functions include:
f_regression
mutual_info_regression
f_classif
mutual_info_classif
chi2
All scoring functions are provided in
sklearn.feature_selection.- fit(params)¶
Function that fits the scoring function over (X_train, y_train) and generates the scores and pvalues for all features with the target label. If no scoring function is passed, then defaults to
f_classifyorf_regressionbased on the predictive problem.- Parameters:
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetest_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetarget_label (str) – Name of the Target Column.
score_func (callable) – Function taking two arrays X and y, and returning a pair of arrays
(scores, pvalues)or a single array with scores.score_funccan be custom or used fromsklearn.feature_selectionk (int) – Number of top features to select.
- Raises:
TypeError – The scoring function should be a callable.
- fit_transform(params)¶
Does fit() and transform() in single step
- Parameters:
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetest_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetarget_label (str) – Name of the Target Column.
score_func (callable) – Function taking two arrays X and y, and returning a pair of arrays
(scores, pvalues)or a single array with scores.score_funcis provided fromsklearn.feature_selectionk (int) – Number of top features to select.
- Raises:
TypeError – The scoring function should be a callable.
ValueError – No features are selected when
k = 0ValueError – After performing
fit(),ValueErroris raised if length of mask and number oftrain_dfortest_dfcolumns do not match.
- transform(params)¶
Function to reduce
train_dfandtest_dfto the selected features. Adds dataframes of shape(n_samples, k)toparams- Parameters:
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetest_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetarget_label (str) – Name of the Target Column.
score_func (callable) – Function taking two arrays X and y, and returning a pair of arrays
(scores, pvalues)or a single array with scores.score_funcis provided fromsklearn.feature_selectionk (int) – Number of top features to select.
- Raises:
ValueError – No features are selected when
k = 0ValueError – After performing
fit(),ValueErroris raised if length of mask and number oftrain_dfortest_dfcolumns do not match.
Input¶
- class preprocessy.input.Reader¶
Standard Reader Class that serves to read and load numeric data into pandas dataframe.
The file extensions allowed are: .csv, .xls, .xlxs, .xlsm, .xlsb, .odf, .ods, .odt
- read_file(params)¶
Function to take the train and test dataframe paths and load it in pandas dataframe
- Parameters:
train_df_path (str) – Path that points to the train dataset(Extension can be any of the above listed). Should not be
None.test_df_path (str) – Path that points to the test dataset(Extension can be any of the above listed).
Null Values¶
- class preprocessy.missing_data.NullValuesHandler¶
- execute(params)¶
Function that handles null values in the supplied dataframe and returns a new dataframe. If no user parameters are supplied, the rows containing null values are dropped by default.
- Parameters:
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetest_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
cat_cols (list) – List containing the names of categorical columns
replace_cat_nulls (str) – The value which will replace null values in the categorical columns
drop_cols (list) – List of column names of columns to be dropped
fill_missing (dict) – Dictionary of format {“method”: [col_list]} to indicate the method (
mean/median) to be applied on specified col_listfill_values (dict) – Column and value mapping, where the key is the column name and value is the custom value to be filled in place of null values
Outliers¶
- class preprocessy.outliers.HandleOutlier¶
Class for handling outliers on its own or according to users needs. This class handles outliers using the percentile concept. The 2 percentile markers will represent the data to be kept ie if one marker is 5th percentile and other is 95th percentile then the data between this range is kept.
Here we use the percentile to calculate the data points they point to and keep those data points that are in the range of the 2 calculated data points
- handle_outliers(params)¶
This function is used to handle outliers is flexible in how to calculate the percentiles and what to do about the outliers.
- Parameters:
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetest_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetarget_label (str) – Name of the Target Column. This parameter is needed to ensure that the target column isn’t included in the outlier removing process.
cat_cols (List) – List containing the column names to be encoded categorically. This parameter is needed to ensure that the categorical columns isn’t included in the outlier removing process.
remove_outliers (bool, default=True) – Boolean value to indicate whether user wants to remove outlier or not.
ord_cols (List) – List containing the column names to be encoded ordinally.This parameter is needed to ensure that the ordinal columns isn’t included in the outlier removing process.
replace (bool, default=False) – Boolean value to indicate if the outliers need to be replaced. This will replace the outliers with
-999and will not remove them.first_quartile (float) – Float value <1 representing the first percentile marker
third_quartile (float) – Float value <1 representing the other percentile marker.
Scaling¶
- class preprocessy.scaling.Scaler¶
- execute(params)¶
Method for scaling the columns in a dataset
- Parameters:
train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be
Nonetest_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
type ("MinMaxScaler" | "BinaryScaler" | "StandardScaler") – The type of Scaler to be used
columns (list) – List of columns in the dataframe
cat_cols (list) – List containing the names of categorical columns
target_label (str) – Name of the Target Column. This parameter is needed to ensure that the target column isn’t scaled
is_combined (bool) – Parameter to determine whether columns should be scaled together as a group
threshold (dict) – Dictionary of threshold values where the key is the column name and the value is the threshold for that column.
Pipelines¶
- class preprocessy.pipelines.BasePipeline(train_df_path=None, test_df_path=None, steps=None, config_file=None, params=None, custom_reader=None)¶
The
BasePipelineClass can be used to create your own customized pipeline.- Parameters:
train_df_path (str) – Path to train dataframe Should not be
Nonetest_df_path (str) – Path to train dataframe Should not be
Nonesteps (list) – A list of functions which will be executed sequentially. All the functions should be callables
params (dict) – A dictionary containing the parameters that are needed for configuring the pipeline
config_file (str) – Path to a config file that consists the parameters for configuring the pipeline. An alternative to
params. A config file for the currentparamsdictionary can be generated using thesave_configutilitycustom_reader (callable) – Custom function to read the data
Changed in version 1.0.4:
paramsfield of the pipeline is made private and is initialised as a deepcopy of the passed in parameter dict.- add(func=None, params=None, **kwargs)¶
Method to add another function to the pipeline after it has been constructed. The parameters of the newly added function are merged with
self.__params.- Parameters:
func (callable) – The function to be added
params (dict) – Dictionary of configurable parameters to be added to the existing
paramsdictionary. Can be empty orNone.index (int) – The index at which the function is to be inserted.
after (str) – The step name after which the function should be added
before (str) – The step name before which the function should be added
To add a function, either the index position or the
before/afterpositional arguments can be suppliedIf
index,afterandbeforeare all provided, the method will follow the priority:index>after>before- Raises:
ArgumentsError – If no position is provided to insert the function into the pipeline
ValueError – If
paramscontains a key that already exists inself.__params
- get_params()¶
Returns the parameter dictionary
- Returns:
Parameter dictionary
- Return type:
dict
New in version 1.0.4.
- print_info()¶
Prints the current configuration of the pipeline. Shows the steps, dataframe paths and config paths.
- process()¶
Method that executes the pipeline sequentially.
- remove(func_name=None)¶
Method to remove a function from the pipeline
- Parameters:
func_name (str) – The name of the function which has to be removed from the pipeline
- Raises:
TypeError – If
func_nameis not of typestr
- save_config(file_path, config_drop_keys=None)¶
Method to save the
paramsto aJSONconfig file.- Parameters:
file_path (str) – Path where the config file must be created
config_drop_keys (list, optional) – List of param keys that must not be stored in the config file, defaults to
["train_df", "test_df", "X_train", "X_test", "y_train", "y_test"]
- set_params(params)¶
Set and overwrite the parameter dictionary initialised in the constructor
- Parameters:
params (dict) – A dictionary containing the parameters that are needed for configuring the pipeline.
- Raises:
ArgumentsError –
paramsshould not be a dictionary.
New in version 1.0.4.
- class preprocessy.pipelines.FeatureSelectionPipeline(train_df_path=None, test_df_path=None, steps=None, config_file=None, params=None, custom_reader=None)¶
Pre-built pipeline that can be used for feature selection
The steps of the pipeline are:
Parser
NullValuesHandler
Encoder
HandleOutlier
Scaler
SelectKBest
Split
- class preprocessy.pipelines.StandardPipeline(train_df_path, test_df_path=None, config_file=None, params=None, custom_reader=None)¶
Pre-built generic pipeline that can be used for most datasets
The steps of the pipeline are:
Parser
NullValuesHandler
Encoder
HandleOutlier
Scaler
Split