Parameters for pipeline configuration¶
The parameters required by both the built-in pipelines as well as the custom pipelines
are given as a dict
object or as a path to a json
file. These params are used
by individual functions of the pipeline.
When writing custom functions, you can add custom params that are required by your function. The params dict will be passed to your function when it is called as a part of the pipeline and will have access to your custom parameter.
Warning
If the name of your custom parameter conflicts with one of those required by the built-in functions, then the value will override the parameter and lead to uncertain outputs and errors.
Please ensure that you do not use conflicting names for your custom parameters.
The paramters used by the built-in functions are listed below along with their meaning,
expected dtype and default value, if any. Any parameter marked *
is required.
train_df_path *
Path to the train dataset. Path should point to a file of one of the supported extensions.
For list of allowed extensions see preprocessy.input
.
dtype: str
example: "/Users/home/datasets/titanic.csv"
test_df_path
Path to the test dataset. Path should point to a file of one of the supported extensions.
For list of allowed extensions see preprocessy.input
.
dtype: str
example: "/Users/home/datasets/titanic_test.csv"
target_label
Name of the target column.
dtype: str
example: "Survived"
cat_cols
List of column names provided by user indicating these columns are to be encoded categorically. If None
then Preprocessy
analyses and identifies the columns on its own.
dtype: list[str]
example: ["Gender", "Cabin", "Embarked"]
ord_dict
Ordinal attributes require an associated weight mapping. This parameter is mapping from attribute column name to weight mapping. Weight mapping is a dict specifying the weight associated with each unique value of the attribute in consideration.
dtype: dict[str => dict[str => int | float]]
example: {
"Difficulty": {
"Easy": 5
"Medium": 10,
"Hard": 15
}
}
replace_cat_nulls
When handling missing data for categorical and ordinal attributes, the int
provided here will be used
to replace the null value. If no value is provided, then the column will be dropped.
dtype: int
example: 99
drop_cols
List of column names to be dropped.
dtype: list[str]
example: ["PassengerId", "Name"]
fill_missing
Dictionary of format {“method”: [col]} to indicate the method (mean
/median
) to be applied on specified list of columns.
dtype: dict["mean" | "median" => list[str]]
example: {
"mean": ["col_A", "col_B"],
"median": ["col_C"]
}
fill_values
Dictionary with keys as column names and values that fill the null records in corresponding column.
dtype: dict[str => any]
example: {
"Age": 19,
"Name": "John"
}
one_hot
True
if one hot encoding is desired. Default = False
.
dtype: bool
remove_outliers
True
if outlier records are to be removed. If both remove_outliers
and replace
are False
, a warning will be raised and no operation will be performed. Default = True
dtype: bool
replace
Boolean value to indicate if the outliers need to be replaced by -999
. Default = False
.
dtype: bool
first_quartile
Float value between 0 and 1, representing the first quartile marker. For more see preprocessy.outliers
dtype: float
example: 0.25
third_quartile
Float value between 0 and 1, representing the third quartile marker. For more see preprocessy.outliers
dtype: float
example: 0.75
type
The type of Scaler to be used. Default = StandardScaler
.
dtype: "MinMaxScaler" | "BinaryScaler" | "StandardScaler"
example: "MinMaxScaler"
columns
List of columns in the dataframe for which scaling is to be done. If None
is provided, defaults to all columns of a Numeric dtype.
dtype: list[str]
example : ["Fare"]
is_combined
Parameter to determine whether columns should be scaled together as a group.
dtype: bool
threshold
BinaryScaler
uses a dictionary of threshold values where the key is the column name and the
value is the threshold for that column. All values less than or equal to the threshold are scaled to 0.
Values above the threshold are scaled to 1. The default threshold value is 0.
dtype: dict[str => int | float]
example: {
"Age": 17
}
score_func
Function taking two arrays X and y, and returning a pair of arrays
(scores, pvalues)
or a single array with scores. score_func
can be custom
or used from sklearn.feature_selection
dtype: func(iterable, iterable) => (list[float], list[float])
example: f_classif from sklearn
k
Number of top features to select.
dtype: int
example: 10
test_size
Size of test set after splitting. Can take values from 0 - 1 for floating point values, 0 - Number of samples for integer values. It is complementary to train size.
dtype: int | float
example: 0.2, 200
train_size
Size of train set after splitting. Can take values from 0 - 1 for floating point values,
0 - Number of samples for integer values. It is complementary to test size. If both train_size
and test_size
are given, then train_size + test_size
should be equal to 1 if sizes are
floating point values, else the total size of the dataset.
dtype: int | float
example: 0.8, 800
n_splits
Number of folds to be made in K-fold cross validation. Must be at least 2.
For more see preprocessy.data_splitting.KFold
dtype: int
example: 5
shuffle
Decides whether to shuffle data before splitting. Default = False
dtype: bool
random_state
Seeding to be provided for shuffling before splitting. Requires shuffle
to be True
.
dtype: int
example: 0