API

This part of the documentation lists the full API reference of all public classes and functions.

Data Splitting

class preprocessy.data_splitting.Split

Class for splitting the dataset into train and test sets

train_test_split(params)

Performs train test split on the input data

Parameters
  • train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.

  • target_label (str) – Name of the Target Column.

  • test_size (float, int) – Size of test set after splitting. Can take values from 0 - 1 for float point values, 0 - Number of samples for integer values. Is complementary to train size.

  • train_size (float, int) – Size of train set after splitting. Can take values from 0 - 1 for float point values, 0 - Number of samples for integer values. Is complementary to test size.

  • shuffle (bool, default = False) – Decides whether to shuffle data before splitting.

  • random_state (int) – Seeding to be provided for shuffling before splitting.

The functions inserts the following into params -

If target label is provided

  • X_train : pandas.core.frames.DataFrame

  • y_train : pandas.core.series.Series

  • X_test : pandas.core.frames.DataFrame

  • y_test : pandas.core.series.Series

Else

  • train: pandas.core.frames.DataFrame

  • test: pandas.core.frames.DataFrame

Raises

ValueError – If the target column does not have a name property ValueError is raised.

class preprocessy.data_splitting.KFold

Class for splitting input data into K-folds. Split dataset into K consecutive folds (without shuffling by default).

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

get_n_splits()

Returns the number of splitting iterations in the cross-validator

Returns

Returns the number of splitting iterations in the cross-validator.

Return type

int

split(params)

Generate indices to split data into training and test set.

Parameters
  • train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.

  • n_splits (int, default = 5) – Number of folds. Must be at least 2

  • shuffle (bool, default = False) – Whether to shuffle the data before splitting into folds

  • random_state (int, default = None) – Random state used for shuffling

Yield

The training set indices for that split

Return type

ndarray

Yield

The testing set indices for that split

Return type

ndarray

Raises

ValueError – If n_splits > n_samples, then a ValueError is raised

Encoding

class preprocessy.encoding.Encoder

Class to encode categorical and ordinal features. Categorical encoding options include: normal and one-hot

encode(params)

Function to encode categorical or ordinal columns.

Parameters
  • train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.

  • target_label (str) – Name of the Target Column. This parameter is needed to ensure that the target column isn’t identified as categorical and encoded.

  • cat_cols (list) – List containing the column names to be encoded categorically

  • ord_dict (dict) – Dictionary with the the key as name of column to be encoded ordinally and the corresponding value is the dictionary containing the mapping.

  • one_hot – This parameter takes True or False to indicate whether the user wants to encode using one-hot.

Feature Selection

class preprocessy.feature_selection.SelectKBest

Class for finding K highest scoring features among the set of all features. Takes a feature and finds its correlation with the target label using a scoring function.

Scoring functions include:

  1. f_regression

  2. mutual_info_regression

  3. f_classif

  4. mutual_info_classif

  5. chi2

All scoring functions are provided in sklearn.feature_selection.

fit(params)

Function that fits the scoring function over (X_train, y_train) and generates the scores and pvalues for all features with the target label. If no scoring function is passed, then defaults to f_classify or f_regression based on the predictive problem.

Parameters
  • train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • target_label (str) – Name of the Target Column.

  • score_func (callable) – Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. score_func can be custom or used from sklearn.feature_selection

  • k (int) – Number of top features to select.

Raises

TypeError – The scoring function should be a callable.

fit_transform(params)

Does fit() and transform() in single step

Parameters
  • train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • target_label (str) – Name of the Target Column.

  • score_func (callable) – Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. score_func is provided from sklearn.feature_selection

  • k (int) – Number of top features to select.

Raises
  • TypeError – The scoring function should be a callable.

  • ValueError – No features are selected when k = 0

  • ValueError – After performing fit(), ValueError is raised if length of mask and number of train_df or test_df columns do not match.

transform(params)

Function to reduce train_df and test_df to the selected features. Adds dataframes of shape (n_samples, k) to params

Parameters
  • train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • target_label (str) – Name of the Target Column.

  • score_func (callable) – Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. score_func is provided from sklearn.feature_selection

  • k (int) – Number of top features to select.

Raises
  • ValueError – No features are selected when k = 0

  • ValueError – After performing fit(), ValueError is raised if length of mask and number of train_df or test_df columns do not match.

Input

class preprocessy.input.Reader

Standard Reader Class that serves to read and load numeric data into pandas dataframe.

The file extensions allowed are: .csv, .xls, .xlxs, .xlsm, .xlsb, .odf, .ods, .odt

read_file(params)

Function to take the train and test dataframe paths and load it in pandas dataframe

Parameters
  • train_df_path (str) – Path that points to the train dataset(Extension can be any of the above listed). Should not be None.

  • test_df_path (str) – Path that points to the test dataset(Extension can be any of the above listed).

Null Values

class preprocessy.missing_data.NullValuesHandler
execute(params)

Function that handles null values in the supplied dataframe and returns a new dataframe. If no user parameters are supplied, the rows containing null values are dropped by default.

Parameters
  • train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.

  • cat_cols (list) – List containing the names of categorical columns

  • replace_cat_nulls (str) – The value which will replace null values in the categorical columns

  • drop_cols (list) – List of column names of columns to be dropped

  • fill_missing (dict) – Dictionary of format {“method”: [col_list]} to indicate the method (mean/median) to be applied on specified col_list

  • fill_values (dict) – Column and value mapping, where the key is the column name and value is the custom value to be filled in place of null values

Outliers

class preprocessy.outliers.HandleOutlier

Class for handling outliers on its own or according to users needs. This class handles outliers using the percentile concept. The 2 percentile markers will represent the data to be kept ie if one marker is 5th percentile and other is 95th percentile then the data between this range is kept.

Here we use the percentile to calculate the data points they point to and keep those data points that are in the range of the 2 calculated data points

handle_outliers(params)

This function is used to handle outliers is flexible in how to calculate the percentiles and what to do about the outliers.

Parameters
  • train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • target_label (str) – Name of the Target Column. This parameter is needed to ensure that the target column isn’t included in the outlier removing process.

  • cat_cols (List) – List containing the column names to be encoded categorically. This parameter is needed to ensure that the categorical columns isn’t included in the outlier removing process.

  • remove_outliers (bool, default=True) – Boolean value to indicate whether user wants to remove outlier or not.

  • ord_cols (List) – List containing the column names to be encoded ordinally.This parameter is needed to ensure that the ordinal columns isn’t included in the outlier removing process.

  • replace (bool, default=False) – Boolean value to indicate if the outliers need to be replaced. This will replace the outliers with -999 and will not remove them.

  • first_quartile (float) – Float value <1 representing the first percentile marker

  • third_quartile (float) – Float value <1 representing the other percentile marker.

Scaling

class preprocessy.scaling.Scaler
execute(params)

Method for scaling the columns in a dataset

Parameters
  • train_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label. Should not be None

  • test_df (pandas.core.frames.DataFrame) – Input dataframe, may or may not consist of the target label.

  • type ("MinMaxScaler" | "BinaryScaler" | "StandardScaler") – The type of Scaler to be used

  • columns (list) – List of columns in the dataframe

  • cat_cols (list) – List containing the names of categorical columns

  • target_label (str) – Name of the Target Column. This parameter is needed to ensure that the target column isn’t scaled

  • is_combined (bool) – Parameter to determine whether columns should be scaled together as a group

  • threshold (dict) – Dictionary of threshold values where the key is the column name and the value is the threshold for that column.

Pipelines

class preprocessy.pipelines.BasePipeline(train_df_path=None, test_df_path=None, steps=None, config_file=None, params=None, custom_reader=None)

The BasePipeline Class can be used to create your own customized pipeline.

Parameters
  • train_df_path (str) – Path to train dataframe Should not be None

  • test_df_path (str) – Path to train dataframe Should not be None

  • steps (list) – A list of functions which will be executed sequentially. All the functions should be callables

  • params (dict) – A dictionary containing the parameters that are needed for configuring the pipeline

  • config_file (str) – Path to a config file that consists the parameters for configuring the pipeline. An alternative to params. A config file for the current params dictionary can be generated using the save_config utility

  • custom_reader (callable) – Custom function to read the data

Changed in version 1.0.4: params field of the pipeline is made private and is initialised as a deepcopy of the passed in parameter dict.

add(func=None, params=None, **kwargs)

Method to add another function to the pipeline after it has been constructed. The parameters of the newly added function are merged with self.__params.

Parameters
  • func (callable) – The function to be added

  • params (dict) – Dictionary of configurable parameters to be added to the existing params dictionary. Can be empty or None.

  • index (int) – The index at which the function is to be inserted.

  • after (str) – The step name after which the function should be added

  • before (str) – The step name before which the function should be added

To add a function, either the index position or the before/after positional arguments can be supplied

If index, after and before are all provided, the method will follow the priority: index > after > before

Raises
  • ArgumentsError – If no position is provided to insert the function into the pipeline

  • ValueError – If params contains a key that already exists in self.__params

get_params()

Returns the parameter dictionary

Returns

Parameter dictionary

Return type

dict

New in version 1.0.4.

print_info()

Prints the current configuration of the pipeline. Shows the steps, dataframe paths and config paths.

process()

Method that executes the pipeline sequentially.

remove(func_name=None)

Method to remove a function from the pipeline

Parameters

func_name (str) – The name of the function which has to be removed from the pipeline

Raises

TypeError – If func_name is not of type str

save_config(file_path, config_drop_keys=None)

Method to save the params to a JSON config file.

Parameters
  • file_path (str) – Path where the config file must be created

  • config_drop_keys (list, optional) – List of param keys that must not be stored in the config file, defaults to ["train_df", "test_df", "X_train", "X_test", "y_train", "y_test"]

set_params(params)

Set and overwrite the parameter dictionary initialised in the constructor

Parameters

params (dict) – A dictionary containing the parameters that are needed for configuring the pipeline.

Raises

ArgumentsErrorparams should not be a dictionary.

New in version 1.0.4.

class preprocessy.pipelines.FeatureSelectionPipeline(train_df_path=None, test_df_path=None, steps=None, config_file=None, params=None, custom_reader=None)

Pre-built pipeline that can be used for feature selection

The steps of the pipeline are:

  1. Parser

  2. NullValuesHandler

  3. Encoder

  4. HandleOutlier

  5. Scaler

  6. SelectKBest

  7. Split

class preprocessy.pipelines.StandardPipeline(train_df_path, test_df_path=None, config_file=None, params=None, custom_reader=None)

Pre-built generic pipeline that can be used for most datasets

The steps of the pipeline are:

  1. Parser

  2. NullValuesHandler

  3. Encoder

  4. HandleOutlier

  5. Scaler

  6. Split