API¶

This part of the documentation lists the full API reference of all public classes and functions.

Data Splitting¶

class preprocessy.data_splitting.Split¶

Class for splitting the dataset into train and test sets

train_test_split(params)¶

Performs train test split on the input data

Parameters

train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
target_label (str) – Name of the Target Column.
test_size (float, int) – Size of test set after splitting. Can take values from 0 - 1 for float point values, 0 - Number of samples for integer values. Is complementary to train size.
train_size (float, int) – Size of train set after splitting. Can take values from 0 - 1 for float point values, 0 - Number of samples for integer values. Is complementary to test size.
shuffle (bool, default = False) – Decides whether to shuffle data before splitting.
random_state (int) – Seeding to be provided for shuffling before splitting.

The functions inserts the following into params -

If target label is provided

X_train : pandas.core.frames.DataFrame
y_train : pandas.core.series.Series
X_test : pandas.core.frames.DataFrame
y_test : pandas.core.series.Series

Else

train: pandas.core.frames.DataFrame
test: pandas.core.frames.DataFrame

Raises: ValueError – If the target column does not have a name property ValueError is raised.

class preprocessy.data_splitting.KFold¶

Class for splitting input data into K-folds. Split dataset into K consecutive folds (without shuffling by default).

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

get_n_splits()¶

Returns the number of splitting iterations in the cross-validator

Returns: Returns the number of splitting iterations in the cross-validator.
Return type: int

split(params)¶

Generate indices to split data into training and test set.

Parameters

train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
n_splits (int, default = 5) – Number of folds. Must be at least 2
shuffle (bool, default = False) – Whether to shuffle the data before splitting into folds
random_state (int, default = None) – Random state used for shuffling

Yield

The training set indices for that split

Return type

ndarray

Yield

The testing set indices for that split

Return type

ndarray

Raises

ValueError – If n_splits > n_samples, then a ValueError is raised

Encoding¶

class preprocessy.encoding.Encoder¶

Class to encode categorical and ordinal features. Categorical encoding options include: normal and one-hot

encode(params)¶

Function to encode categorical or ordinal columns.

Parameters

train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
target_label (str) – Name of the Target Column. This parameter is needed to ensure that the target column isn’t identified as categorical and encoded.
cat_cols (list) – List containing the column names to be encoded categorically
ord_dict (dict) – Dictionary with the the key as name of column to be encoded ordinally and the corresponding value is the dictionary containing the mapping.
one_hot – This parameter takes True or False to indicate whether the user wants to encode using one-hot.

Feature Selection¶

class preprocessy.feature_selection.SelectKBest¶

Class for finding K highest scoring features among the set of all features. Takes a feature and finds its correlation with the target label using a scoring function.

Scoring functions include:

f_regression
mutual_info_regression
f_classif
mutual_info_classif
chi2

All scoring functions are provided in sklearn.feature_selection.

fit(params)¶

Function that fits the scoring function over (X_train, y_train) and generates the scores and pvalues for all features with the target label. If no scoring function is passed, then defaults to f_classify or f_regression based on the predictive problem.

Parameters

train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
target_label (str) – Name of the Target Column.
score_func (callable) – Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. score_func can be custom or used from sklearn.feature_selection
k (int) – Number of top features to select.

Raises

TypeError – The scoring function should be a callable.

fit_transform(params)¶

Does fit() and transform() in single step

Parameters

train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
target_label (str) – Name of the Target Column.
score_func (callable) – Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. score_func is provided from sklearn.feature_selection
k (int) – Number of top features to select.

Raises

TypeError – The scoring function should be a callable.
ValueError – No features are selected when k = 0
ValueError – After performing fit(), ValueError is raised if length of mask and number of train_df or test_df columns do not match.

transform(params)¶

Function to reduce train_df and test_df to the selected features. Adds dataframes of shape (n_samples, k) to params

Parameters

train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
target_label (str) – Name of the Target Column.
score_func (callable) – Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. score_func is provided from sklearn.feature_selection
k (int) – Number of top features to select.

Raises

ValueError – No features are selected when k = 0
ValueError – After performing fit(), ValueError is raised if length of mask and number of train_df or test_df columns do not match.

Input¶

class preprocessy.input.Reader¶

Standard Reader Class that serves to read and load numeric data into pandas dataframe.

The file extensions allowed are: .csv, .xls, .xlxs, .xlsm, .xlsb, .odf, .ods, .odt

read_file(params)¶

Function to take the train and test dataframe paths and load it in pandas dataframe

Parameters

train_df_path (str) – Path that points to the train dataset(Extension can be any of the above listed). Should not be None.
test_df_path (str) – Path that points to the test dataset(Extension can be any of the above listed).

Null Values¶

class preprocessy.missing_data.NullValuesHandler¶

execute(params)¶

Function that handles null values in the supplied dataframe and returns a new dataframe. If no user parameters are supplied, the rows containing null values are dropped by default.

Parameters

train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
cat_cols (list) – List containing the names of categorical columns
replace_cat_nulls (str) – The value which will replace null values in the categorical columns
drop_cols (list) – List of column names of columns to be dropped
fill_missing (dict) – Dictionary of format {“method”: [col_list]} to indicate the method (mean/median) to be applied on specified col_list
fill_values (dict) – Column and value mapping, where the key is the column name and value is the custom value to be filled in place of null values

Outliers¶

class preprocessy.outliers.HandleOutlier¶

Class for handling outliers on its own or according to users needs. This class handles outliers using the percentile concept. The 2 percentile markers will represent the data to be kept ie if one marker is 5th percentile and other is 95th percentile then the data between this range is kept.

Here we use the percentile to calculate the data points they point to and keep those data points that are in the range of the 2 calculated data points

handle_outliers(params)¶

This function is used to handle outliers is flexible in how to calculate the percentiles and what to do about the outliers.

Parameters

train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
target_label (str) – Name of the Target Column. This parameter is needed to ensure that the target column isn’t included in the outlier removing process.
cat_cols (List) – List containing the column names to be encoded categorically. This parameter is needed to ensure that the categorical columns isn’t included in the outlier removing process.
remove_outliers (bool, default=True) – Boolean value to indicate whether user wants to remove outlier or not.
ord_cols (List) – List containing the column names to be encoded ordinally.This parameter is needed to ensure that the ordinal columns isn’t included in the outlier removing process.
replace (bool, default=False) – Boolean value to indicate if the outliers need to be replaced. This will replace the outliers with -999 and will not remove them.
first_quartile (float) – Float value <1 representing the first percentile marker
third_quartile (float) – Float value <1 representing the other percentile marker.

Scaling¶

class preprocessy.scaling.Scaler¶

execute(params)¶

Method for scaling the columns in a dataset

Parameters

train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None
test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.
type ("MinMaxScaler" | "BinaryScaler" | "StandardScaler") – The type of Scaler to be used
columns (list) – List of columns in the dataframe
cat_cols (list) – List containing the names of categorical columns
target_label (str) – Name of the Target Column. This parameter is needed to ensure that the target column isn’t scaled
is_combined (bool) – Parameter to determine whether columns should be scaled together as a group
threshold (dict) – Dictionary of threshold values where the key is the column name and the value is the threshold for that column.

Pipelines¶

class preprocessy.pipelines.BasePipeline(train_df_path=None, test_df_path=None, steps=None, config_file=None, params=None, custom_reader=None)¶

The BasePipeline Class can be used to create your own customized pipeline.

Parameters

train_df_path (str) – Path to train dataframe Should not be None
test_df_path (str) – Path to train dataframe Should not be None
steps (list) – A list of functions which will be executed sequentially. All the functions should be callables
params (dict) – A dictionary containing the parameters that are needed for configuring the pipeline
config_file (str) – Path to a config file that consists the parameters for configuring the pipeline. An alternative to params. A config file for the current params dictionary can be generated using the save_config utility
custom_reader (callable) – Custom function to read the data

Changed in version 1.0.4: params field of the pipeline is made private and is initialised as a deepcopy of the passed in parameter dict.

add(func=None, params=None, **kwargs)¶

Method to add another function to the pipeline after it has been constructed. The parameters of the newly added function are merged with self.__params.

Parameters

func (callable) – The function to be added
params (dict) – Dictionary of configurable parameters to be added to the existing params dictionary. Can be empty or None.
index (int) – The index at which the function is to be inserted.
after (str) – The step name after which the function should be added
before (str) – The step name before which the function should be added

To add a function, either the index position or the before/after positional arguments can be supplied

If index, after and before are all provided, the method will follow the priority: index > after > before

Raises

ArgumentsError – If no position is provided to insert the function into the pipeline
ValueError – If params contains a key that already exists in self.__params

get_params()¶

Returns the parameter dictionary

Returns: Parameter dictionary
Return type: dict

New in version 1.0.4.

print_info()¶: Prints the current configuration of the pipeline. Shows the steps, dataframe paths and config paths.

process()¶: Method that executes the pipeline sequentially.

remove(func_name=None)¶

Method to remove a function from the pipeline

Parameters: func_name (str) – The name of the function which has to be removed from the pipeline
Raises: TypeError – If func_name is not of type str

save_config(file_path, config_drop_keys=None)¶

Method to save the params to a JSON config file.

Parameters

file_path (str) – Path where the config file must be created
config_drop_keys (list, optional) – List of param keys that must not be stored in the config file, defaults to ["train_df", "test_df", "X_train", "X_test", "y_train", "y_test"]

set_params(params)¶

Set and overwrite the parameter dictionary initialised in the constructor

Parameters: params (dict) – A dictionary containing the parameters that are needed for configuring the pipeline.
Raises: ArgumentsError – params should not be a dictionary.

New in version 1.0.4.

class preprocessy.pipelines.FeatureSelectionPipeline(train_df_path=None, test_df_path=None, steps=None, config_file=None, params=None, custom_reader=None)¶

Pre-built pipeline that can be used for feature selection

The steps of the pipeline are:

Parser
NullValuesHandler
Encoder
HandleOutlier
Scaler
SelectKBest
Split

class preprocessy.pipelines.StandardPipeline(train_df_path, test_df_path=None, config_file=None, params=None, custom_reader=None)¶

Pre-built generic pipeline that can be used for most datasets

The steps of the pipeline are:

Parser
NullValuesHandler
Encoder
HandleOutlier
Scaler
Split

Navigation

Related Topics

API¶

Data Splitting¶

Encoding¶

Feature Selection¶

Input¶

Null Values¶

Outliers¶

Scaling¶

Pipelines¶