Parameters for pipeline configuration

The parameters required by both the built-in pipelines as well as the custom pipelines are given as a dict object or as a path to a json file. These params are used by individual functions of the pipeline.

When writing custom functions, you can add custom params that are required by your function. The params dict will be passed to your function when it is called as a part of the pipeline and will have access to your custom parameter.

Warning

If the name of your custom parameter conflicts with one of those required by the built-in functions, then the value will override the parameter and lead to uncertain outputs and errors.

Please ensure that you do not use conflicting names for your custom parameters.

The paramters used by the built-in functions are listed below along with their meaning, expected dtype and default value, if any. Any parameter marked * is required.

  • train_df_path *

Path to the train dataset. Path should point to a file of one of the supported extensions. For list of allowed extensions see preprocessy.input.

dtype: str
example: "/Users/home/datasets/titanic.csv"
  • test_df_path

Path to the test dataset. Path should point to a file of one of the supported extensions. For list of allowed extensions see preprocessy.input.

dtype: str
example: "/Users/home/datasets/titanic_test.csv"
  • target_label

Name of the target column.

dtype: str
example: "Survived"
  • cat_cols

List of column names provided by user indicating these columns are to be encoded categorically. If None then Preprocessy analyses and identifies the columns on its own.

dtype: list[str]
example: ["Gender", "Cabin", "Embarked"]
  • ord_dict

Ordinal attributes require an associated weight mapping. This parameter is mapping from attribute column name to weight mapping. Weight mapping is a dict specifying the weight associated with each unique value of the attribute in consideration.

dtype: dict[str => dict[str => int | float]]
example: {
    "Difficulty": {
        "Easy": 5
        "Medium": 10,
        "Hard": 15
    }
}
  • replace_cat_nulls

When handling missing data for categorical and ordinal attributes, the int provided here will be used to replace the null value. If no value is provided, then the column will be dropped.

dtype: int
example: 99
  • drop_cols

List of column names to be dropped.

dtype: list[str]
example: ["PassengerId", "Name"]
  • fill_missing

Dictionary of format {“method”: [col]} to indicate the method (mean/median) to be applied on specified list of columns.

dtype: dict["mean" | "median" => list[str]]
example: {
    "mean": ["col_A", "col_B"],
    "median": ["col_C"]
}
  • fill_values

Dictionary with keys as column names and values that fill the null records in corresponding column.

dtype: dict[str => any]
example: {
    "Age": 19,
    "Name": "John"
}
  • one_hot

True if one hot encoding is desired. Default = False.

dtype: bool
  • remove_outliers

True if outlier records are to be removed. If both remove_outliers and replace are False, a warning will be raised and no operation will be performed. Default = True

dtype: bool
  • replace

Boolean value to indicate if the outliers need to be replaced by -999. Default = False.

dtype: bool
  • first_quartile

Float value between 0 and 1, representing the first quartile marker. For more see preprocessy.outliers

dtype: float
example: 0.25
  • third_quartile

Float value between 0 and 1, representing the third quartile marker. For more see preprocessy.outliers

dtype: float
example: 0.75
  • type

The type of Scaler to be used. Default = StandardScaler.

dtype: "MinMaxScaler" | "BinaryScaler" | "StandardScaler"
example: "MinMaxScaler"
  • columns

List of columns in the dataframe for which scaling is to be done. If None is provided, defaults to all columns of a Numeric dtype.

dtype: list[str]
example : ["Fare"]
  • is_combined

Parameter to determine whether columns should be scaled together as a group.

dtype: bool
  • threshold

BinaryScaler uses a dictionary of threshold values where the key is the column name and the value is the threshold for that column. All values less than or equal to the threshold are scaled to 0. Values above the threshold are scaled to 1. The default threshold value is 0.

dtype: dict[str => int | float]
example: {
    "Age": 17
}
  • score_func

Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. score_func can be custom or used from sklearn.feature_selection

dtype: func(iterable, iterable) => (list[float], list[float])
example: f_classif from sklearn
  • k

Number of top features to select.

dtype: int
example: 10
  • test_size

Size of test set after splitting. Can take values from 0 - 1 for floating point values, 0 - Number of samples for integer values. It is complementary to train size.

dtype: int | float
example: 0.2, 200
  • train_size

Size of train set after splitting. Can take values from 0 - 1 for floating point values, 0 - Number of samples for integer values. It is complementary to test size. If both train_size and test_size are given, then train_size + test_size should be equal to 1 if sizes are floating point values, else the total size of the dataset.

dtype: int | float
example: 0.8, 800
  • n_splits

Number of folds to be made in K-fold cross validation. Must be at least 2. For more see preprocessy.data_splitting.KFold

dtype: int
example: 5
  • shuffle

Decides whether to shuffle data before splitting. Default = False

dtype: bool
  • random_state

Seeding to be provided for shuffling before splitting. Requires shuffle to be True.

dtype: int
example: 0