Parameters for pipeline configuration
=====================================

The parameters required by both the built-in pipelines as well as the custom pipelines
are given as a ``dict`` object or as a path to a ``json`` file. These params are used
by individual functions of the pipeline.

When writing custom functions, you can add custom params that are required by your function.
The params dict will be passed to your function when it is called as a part of the pipeline
and will have access to your custom parameter.

.. warning::
    If the name of your custom parameter conflicts with one of those required by the built-in
    functions, then the value will override the parameter and lead to uncertain outputs and errors.

    Please ensure that you do not use conflicting names for your custom parameters.

The paramters used by the built-in functions are listed below along with their meaning,
expected dtype and default value, if any. Any parameter marked ``*`` is required.

- **train_df_path \***

Path to the train dataset. Path should point to a file of one of the supported extensions.
For list of allowed extensions see :py:mod:`preprocessy.input`.

.. code:: python

    dtype: str
    example: "/Users/home/datasets/titanic.csv"

- **test_df_path**

Path to the test dataset. Path should point to a file of one of the supported extensions.
For list of allowed extensions see :py:mod:`preprocessy.input`.

.. code:: python

    dtype: str
    example: "/Users/home/datasets/titanic_test.csv"

- **target_label**

Name of the target column.

.. code:: python

    dtype: str
    example: "Survived"

- **cat_cols**

List of column names provided by user indicating these columns are to be encoded categorically. If ``None`` then ``Preprocessy`` analyses and identifies the columns on its own.

.. code:: python

    dtype: list[str]
    example: ["Gender", "Cabin", "Embarked"]

- **ord_dict**

Ordinal attributes require an associated weight mapping. This parameter is mapping from attribute column
name to weight mapping. Weight mapping is a dict specifying the weight associated with each unique value of
the attribute in consideration.

.. code:: python

    dtype: dict[str => dict[str => int | float]]
    example: {
        "Difficulty": {
            "Easy": 5
            "Medium": 10,
            "Hard": 15
        }
    }

- **replace_cat_nulls**

When handling missing data for categorical and ordinal attributes, the ``int`` provided here will be used
to replace the null value. If no value is provided, then the column will be dropped.

.. code:: python

    dtype: int
    example: 99

- **drop_cols**

List of column names to be dropped.

.. code:: python

    dtype: list[str]
    example: ["PassengerId", "Name"]

- **fill_missing**

Dictionary of format {"method": [col]} to indicate the method (``mean``/``median``) to be applied on specified list of columns.

.. code:: python

    dtype: dict["mean" | "median" => list[str]]
    example: {
        "mean": ["col_A", "col_B"],
        "median": ["col_C"]
    }

- **fill_values**

Dictionary with keys as column names and values that fill the null records in corresponding column.

.. code:: python

    dtype: dict[str => any]
    example: {
        "Age": 19,
        "Name": "John"
    }

- **one_hot**

``True`` if one hot encoding is desired. Default = ``False``.

.. code:: python

    dtype: bool

- **remove_outliers**

``True`` if outlier records are to be removed. If both ``remove_outliers`` and ``replace``
are ``False``, a warning will be raised and no operation will be performed. Default = ``True``

.. code:: python

    dtype: bool

- **replace**

Boolean value to indicate if the outliers need to be replaced by ``-999``. Default = ``False``.

.. code:: python

    dtype: bool

- **first_quartile**

Float value between 0 and 1, representing the first quartile marker. For more see :py:mod:`preprocessy.outliers`

.. code:: python

    dtype: float
    example: 0.25

- **third_quartile**

Float value between 0 and 1, representing the third quartile marker. For more see :py:mod:`preprocessy.outliers`

.. code:: python

    dtype: float
    example: 0.75

- **type**

The type of Scaler to be used. Default = ``StandardScaler``.

.. code:: python

    dtype: "MinMaxScaler" | "BinaryScaler" | "StandardScaler"
    example: "MinMaxScaler"

- **columns**

List of columns in the dataframe for which scaling is to be done. If ``None`` is provided, defaults to all columns of a Numeric dtype.

.. code:: python

    dtype: list[str]
    example : ["Fare"]

- **is_combined**

Parameter to determine whether columns should be scaled together as a group.

.. code:: python

    dtype: bool

- **threshold**

``BinaryScaler`` uses a dictionary of threshold values where the key is the column name and the
value is the threshold for that column. All values less than or equal to the threshold are scaled to 0.
Values above the threshold are scaled to 1. The default threshold value is 0.

.. code:: python

    dtype: dict[str => int | float]
    example: {
        "Age": 17
    }

- **score_func**

Function taking two arrays X and y, and returning a pair of arrays
``(scores, pvalues)`` or a single array with scores. ``score_func`` can be custom
or used from ``sklearn.feature_selection``

.. code:: python

    dtype: func(iterable, iterable) => (list[float], list[float])
    example: f_classif from sklearn

- **k**

Number of top features to select.

.. code:: python

    dtype: int
    example: 10

- **test_size**

Size of test set after splitting. Can take values from 0 - 1 for floating point values,
0 - Number of samples for integer values. It is complementary to train size.

.. code:: python

    dtype: int | float
    example: 0.2, 200

- **train_size**

Size of train set after splitting. Can take values from 0 - 1 for floating point values,
0 - Number of samples for integer values. It is complementary to test size. If both ``train_size``
and ``test_size`` are given, then ``train_size + test_size`` should be equal to 1 if sizes are
floating point values, else the total size of the dataset.

.. code:: python

    dtype: int | float
    example: 0.8, 800

- **n_splits**

Number of folds to be made in K-fold cross validation. Must be at least 2.
For more see :py:mod:`preprocessy.data_splitting.KFold`

.. code:: python

    dtype: int
    example: 5

- **shuffle**

Decides whether to shuffle data before splitting. Default = ``False``

.. code:: python

    dtype: bool

- **random_state**

Seeding to be provided for shuffling before splitting. Requires ``shuffle`` to be ``True``.

.. code:: python

    dtype: int
    example: 0