HyperGBM

HyperGBM is an open source project created by DataCanvas .

Overview

About HyperGBM

HyperGBM is a Full-Pipeline Automated Machine Learning Tool with functions ranging from data cleaning, preprocessing and feature engineering to model selection and hyperparameter tuning. It is an advanced AutoML tool for tabular data.

While a lot of AutoML tools mainly focus on the hyperparameter tuning of different algorithms, HyperGBM designs a high-level search space to include almost all components of machine learning modelling into it, such as data cleaning and algorithm optimizing. This end-to-end optimization approach is more close to a SDP(Sequential Decision Process). Therefore, combined with a meta-learner, HyperGBM adopts advanced algorithms such as reinforcement learning and Monte-Carlo tree search to solve the full-pipeline optimization problem more effectively. These strategies are proven to be effective in practice.

For the machine learning models, HyperGBM uses popular gradient-boosting tree models ranging from XGBoost, LightGBM and HistGradientBoosting. Besides, HyperGBM also involves many advanced features of CompeteExperiment from Hypernets in data cleaning, feature engineering and model ensemble.

The optimization algorithms, representations of search space and CompeteExperiment are based on Hypernets.

Hypergbm also supports full pipeline GPU acceleration, including all data processing and model training steps. In our experiments, we got a 50x performance improvement! More importantly, the model trained on GPU could be deployed to the environment without GPU hardware and software (CUDA and cuML), which greatly reduces the cost of model deployment.

Features

There are four running types of HyperGBM:

  • Single node:running in a single machine and using Pandas and Numpy datatype

  • Single node with NVIDIA GPU device:running in a single machine with NVIDIA GPU devices and using cuDF and cupy datatype

  • Distributed with single node:running in a single machine and using Dask datatype which requires creating Dask collections before using HyperGBM

  • Distributed with multi nodes:running in multiple machines and using Dask datatype which requires creating Dask collections to manage resources for multiple machines before using HyperGBM

The overview of supported features for different running types are displayed in the following table:

Features

Single node

Single node with GPU

Distributed with single node

Distributed with multi nodes

Data Cleaning

Empty characters handling

Recognizing columns types automatically

Columns types correction

Constant columns cleaning

Repeated columns cleaning

Deleting examples without targets

Illegal characters replacing

id columns cleaning

Dataset splitting

Splitting by ratio

Adversarial validation

Feature engineering

Feature generation

Feature dimension reduction

Data preprocessing

SimpleImputer

SafeOrdinalEncoder

TargetEncoder

SafeOneHotEncoder

TruncatedSVD

StandardScaler

MinMaxScaler

MaxAbsScaler

RobustScaler

Imbalanced data handling

ClassWeight

UnderSampling(Nearmiss,Tomekslinks,Random)

OverSampling(SMOTE,ADASYN,Random)

Search algorithms

MCTS

Evolution

Random search

Play back

Early stopping

time limit

no improvements are made after n trials

expected_reward

trail discriminator

Modeling algorithms

XGBoost

LightGBM

CatBoost

HistGridientBoosting

Evaluation

Cross-Validation

Train-Validation-Holdout

Advanced

Automatica task type inference

Data adaption

Collinearity detection

Data drift detection

Feature selection

Feature selection(Two-stage)

Pseudo label(Two-stage)

Pre-searching with UnderSampling

Model ensemble

Installation Guide

We recommend installing HyperGBM with conda or pip. It’s also possible to install and use HyperGBM in a Docker container if you have a Docker environment.

As for software, Python version 3.6 or above is necessary to install HyperGBM.

Using Conda

Install HyperGBM with conda from the channel conda-forge:

conda install -c conda-forge hypergbm

On the Windows system, recommend install pyarrow(required by hypernets) 4.0 or earlier version with HyperGBM:

conda install -c conda-forge hypergbm "pyarrow<=4.0"

Using Pip

Install HyperGBM with different pip options:

  • Typical installation:

pip install hypergbm
  • To run HyperGBM in JupyterLab/Jupyter notebook, install with command:

pip install hypergbm[notebook]
  • To support dataset with simplified Chinese in feature generation,

    • Install jieba package before running HyperGBM.

    • OR install with command:

pip install hypergbm[zhcn]
  • Install all above with one command:

pip install hypergbm[all]

Using Docker

It is possible to use HyperGBM in a Docker container. To do this, users can install HyperGBM with pip in the Dockerfile. We also publish a mirror image in Docker Hub which can be downloaded directly and includes the following components:

  • Python3.7

  • HyperGBM and its dependent packages

  • JupyterLab

Download the mirror image:

docker pull datacanvas/hypergbm

Use the mirror image:

docker run -ti -e NotebookToken="your-token" -p 8888:8888 datacanvas/hypergbm

Then one can visit http://<your-ip>:8888 in the browser and type in the default token to start.

Requirements for GPU acceleration

  • cuML and cuDF

HyperGBM accelerates data processing with NVIDIA RAPIDS cuML and cuDF. Please install them before running HyperGBM on GPU. For detailed instructions, check the link Get RAPIDS.

  • LightGBM with GPU support

Default installation of LightGBM does not support GPU training. Please ensure LightGBM with GPU support before installing HyperGBM. For detailed instructions, check the link LightGBM GPU Tutorial.

  • XGBoost and CatBoost with GPU support

Default installations of XGBoost and CatBoost have supported GPU training. However, if you build them from source code, please enable GPU support.

Quick Start

This section will introduce the main features of HyperGBM with assumption that users already have knowledge of machine learning such as loading data and training a model. If you have not completed HyperGBM installation, please refer to Installation Guide.

Below are two examples of using GBM with python and command line tool.

Use HyperGBM with Python

HyperGBM is developed with Python. We recommend using the Python tool make_experiment to create experiment and train the model.

The basic steps for training the model with make_experiment are as follows:

  • Prepare the dataset(pandas or dask DataFrame)

  • Create experiment with make_experiment

  • Call the .run() method of experiment to performing training and get the model

  • Predict with trained model or save it with the Python tool pickle

Prepare the dataset

Both pandas and dask can be loaded depending on your task types to get DataFrame for training the model.

Taking loading the sklearn dataset breast_cancer as an example,one can get the dataset by following several procedures:

import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split

X,y = datasets.load_breast_cancer(as_frame=True,return_X_y=True)
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_state=335)
train_data = pd.concat([X_train,y_train],axis=1)

where train_data is used for model trianing while X_test are y_test used for evaluating the model.

Create experiment with make_experiment

Users can creating experiment for the prepared dataset and start training the model following procedures below:

from hypergbm import make_experiment


experiment = make_experiment(train_data, target='target', reward_metric='precision')
estimator = experiment.run()

where estimator is the trianed model.

Save the model

It is recommended to save the model with pickle

import pickle
with open('model.pkl','wb') as f:
  pickle.dump(estimator, f)

Evaluate the model

The model can be evaluated with tools provided by sklearn:

from sklearn.metrics import classification_report

y_pred=estimator.predict(X_test)
print(classification_report(y_test, y_pred, digits=5))

output:

              precision    recall  f1-score   support

           0    0.96429   0.93103   0.94737        58
           1    0.96522   0.98230   0.97368       113

    accuracy                        0.96491       171
   macro avg    0.96475   0.95667   0.96053       171
weighted avg    0.96490   0.96491   0.96476       171

More info:

Please refer to the docstring of make_experiment for more information about it:

print(make_experiment.__doc__)

If you are using Notebook or IPython, the following code can provide more information about make_experiment:

make_experiment?

Use HyperGBM with Command Line

HyperGBM offers command line tool hypergbm to perform model training, evaluation and prediction. The following code enables the user to view command line help:

hypergm -h

usage: hypergbm [-h] [--log-level LOG_LEVEL] [-error] [-warn] [-info] [-debug]
                [--verbose VERBOSE] [-v] [--enable-dask ENABLE_DASK] [-dask]
                [--overload OVERLOAD]
                {train,evaluate,predict} ...

hypergbm offers three commands: train, evaluate and predict. To get more information, one can use hypergbm <command> -h:

hypergbm train -h
usage: hypergbm train [-h] --train-data TRAIN_DATA [--eval-data EVAL_DATA]
                      [--test-data TEST_DATA]
                      [--train-test-split-strategy {None,adversarial_validation}]
                      [--target TARGET]
                      [--task {binary,multiclass,regression}]
                      [--max-trials MAX_TRIALS] [--reward-metric METRIC]
                      [--cv CV] [-cv] [-cv-] [--cv-num-folds NUM_FOLDS]
                      [--pos-label POS_LABEL]
                      ...

Prepare the Data

When training model with command line, the training data must be saved in a file of form of csv or parque. The returned model is in the form of pickle whoes file ends with .pkl.

For an example of training Bank Marketing data, one can prepare the data as follows:

from hypernets.tabular.datasets import dsutils
from sklearn.model_selection import train_test_split

df = dsutils.load_bank().head(10000)
df_train, df_test = train_test_split(df, test_size=0.3, random_state=9527)
df_train.to_csv('bank_train.csv', index=None)
df_test.to_csv('bank_eval.csv', index=None)

df_test.pop('y')
df_test.to_csv('bank_to_pred.csv', index=None)

where

  • bank_train.csv is used for training

  • bank_eval.csv is used for evaluating the model

  • bank_to_pred.csv is data without targets for predicting

Train the Model

After preparing the data, one can also perform model training with command line:

hypergbm train --train-data bank_train.csv --target y --model-file model.pkl

one will see model.pkl after this process

ls -l model.pkl

rw-rw-r-- 1 xx xx 9154959    17:09 model.pkl

Evaluate the Model

The trained model can be evaluated with the evaluation data:

hypergbm evaluate --model model.pkl --data bank_eval.csv --metric f1 recall auc

{'f1': 0.7993779160186626, 'recall': 0.7099447513812155, 'auc': 0.9705420982746849}

Predict the Test Data

The trained model can be used for predicting a given data as follows:

hypergbm predict --model model.pkl --data bank_to_pred.csv --output bank_output.csv

where the predicting result will be saved to bank_output.csv.

To add other columns of your predicted data to the above file, one can use the parameter --with-data explicitly:

hypergbm predict --model model.pkl --data bank_to_pred.csv --output bank_output.csv --with-data id
head bank_output.csv

id,y
1563,no
124,no
218,no
463,no
...

Furthermore, including all columns of the test data besides the predicting results to the file bank_output.csv can be done by setting --with-data as “*”:

hypergbm predict --model model.pkl --data bank_to_pred.csv --output bank_output.csv --with-data '*'
head bank_output.csv

id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
1563,55,entrepreneur,married,secondary,no,204,no,no,cellular,14,jul,455,13,-1,0,unknown,no
124,51,management,single,tertiary,yes,-55,yes,no,cellular,11,may,281,2,266,6,failure,no
218,49,blue-collar,married,primary,no,305,yes,yes,telephone,10,jul,834,10,-1,0,unknown,no
463,35,blue-collar,divorced,secondary,no,3102,yes,no,cellular,20,nov,138,1,-1,0,unknown,no
2058,50,management,divorced,tertiary,no,201,yes,no,cellular,24,jul,248,1,-1,0,unknown,no
...

HyperGBM:Job management with Hyperctl

Hyperctl is a general multi-job management tool, which includes but not limit to training, testing and comparison. This section will introduce how to use hyperctl to manage the HyperGBM training tasks.

Firstly, use the python script hypergbm/job.py provided by HyperGBM to read all parameters of the job of hyperctl. Then configure these parameters and transfer them to the function hypergbm.make_experiment to create an experiment. Lastly, the experiment is executed.

Example: Use Hyperctl to train a HyperGBM classification model

  • Create an directory and all operation will be executed within this directory

mkdir /tmp/hyperctl-example
cd /tmp/hyperctl-example
# curl -O heart-disease-uci.csv https://raw.githubusercontent.com/DataCanvasIO/Hypernets/master/hypernets/tabular/datasets/heart-disease-uci.csv
python -c "from hypernets.tabular.datasets.dsutils import load_heart_disease_uci;load_heart_disease_uci().to_csv('heart-disease-uci.csv', index=False)"
  • Create the hyperctl job configuration file batch.json. We could see that it includes the parameters: train_data,target,log_level and run_kwargs:

{
    "jobs": [
        {
            "params": {
                "train_data": "/tmp/hyperctl-example/heart-disease-uci.csv",
                "target": "target",
                "log_level": "info",
                "run_kwargs": {
                  "max_trials": 10
                }
            },
            "execution": {
                "command": "python -m hypergbm.job"
            }
        }
    ]
}

Please note:

  1. In the configuration file, parameters like ‘train_data’,’eval_data’ and ‘test_data’ should be replaced by the corresponding file path. And the file format coule be ‘csv’ or ‘parquet’.

  2. In the configuration file, sub parameters within ‘run_kwargs’ are used to configure the experiment. They will be transfered to the function hypernets.experiment.compete.CompeteExperiment.run

  • Execute the configured job:

hyperctl run --config ./batch.json

HyperGBM: Experiment Visualization in Notebook

This section demonstrates how to visualize a HyperGBM experiment in Jupyter Notebook. With the visualization tool, you could:

  1. check the experiment configurations

  2. check the dataset information

  3. check the processing information

To use these features, an additional package needs to be installed:

pip install experiment-notebook-widget

Example

  1. import the required packages

import warnings
warnings.filterwarnings('ignore')

from hypernets.utils import logging

from sklearn.model_selection import train_test_split

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
  1. creat an experiment

df = dsutils.load_bank()

df_train, df_test = train_test_split(df, test_size=0.8, random_state=42)

experiment = make_experiment(df_train, target='y')
experiment

The experiment configurations is shown as below: _images/hypergbm_experiment_config.png

  1. plot the dataset information

experiment.plot_dataset()

The output information is shown below: _images/hypergbm_experiment_dataset.png

  1. plot the processing information

experiment.run(max_trials=20)

The output information is shown below: _images/hypergbm_experiment_process.png

Check the Notebook example hypegbm_experiment_notebook_visualization.ipynb

Examples

Basic Applications

In this section, we are going to provide an example to show how to train a model using the tool make_experiment. In this example, we use the blood dataset, which is loaded from hypernets.tabular. The columns of this dataset can be shown as follows:

Recency,Frequency,Monetary,Time,Class
2,50,12500,98,1
0,13,3250,28,1
1,16,4000,35,1
2,20,5000,45,1
1,24,6000,77,0
4,4,1000,4,0

...

Create and Run an Experiment

Using the tool make_experiment can create an executable experiment object. The only required parameter of this tool is train_data. Then simply calling the method run of the created experiment object will start training and return a model. Note that if the target column of the data is not y, one needs to manually set it through the parameter target.

An example code:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class')
estimator = experiment.run()
print(estimator)

output:

Pipeline(steps=[('data_clean',
                 DataCleanStep(...),
                ('estimator',
                 GreedyEnsemble(...)])

Training will return a Pipeline while the final returned model is a collection of multiple models.

For training data with file extension .csv or .parquet, the experiment can be created through using the data file path directly and make_experiment will load data as DataFrame automatically. For an example:

from hypergbm import make_experiment

train_data = '/path/to/mydata.csv'
experiment = make_experiment(train_data, target='my_target')
estimator = experiment.run()
print(estimator)

Set the Number of Search Trials

One can set the max search trial number by adjusting max_trials. The following code sets the max searching time as 100:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', max_trials=100)
estimator = experiment.run()
print(estimator)

Use Cross Validation

Users can apply cross validation in the experiment by manually setting parameter cv. Setting cv=False means the experiment will not apply cross validation but applying train_test_split. On the other hand, when cv=True, the experiment will apply cross validation. And the number of folds can be adjusted through the parameter num_folds, whose default value is 3.

Example code when cv=True:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=True, num_folds=5)
estimator = experiment.run()
print(estimator)

Evaluation dataset

When cv=False, the experiment object will additionally require evaluating its perfomance on the evaluation dataset. This can be done by setting eval_data when creating make_experiment. For example:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
from sklearn.model_selection import train_test_split

train_data = dsutils.load_blood()
train_data,eval_data=train_test_split(train_data,test_size=0.3)
experiment = make_experiment(train_data, target='Class', eval_data=eval_data, cv=False)
estimator = experiment.run()
print(estimator)

If the eval_data is not given, the experiment object will split the train_data to obtain an evaluation dataset, whose size can be adjusted by setting eval_size:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', cv=False, eval_size=0.2)
estimator = experiment.run()
print(estimator)

Set the Evaluation Criterion

The default evaluation criterion of the experiment object for classification task is accuracy, while for regression task is rmse. Other criterions can be set through reward_metric. For example:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', reward_metric='auc')
estimator = experiment.run()
print(estimator)

Set the Early Stopping

One can set the early stopping strategy with settings of early_stopping_round, early_stopping_time_limit and early_stopping_reward.

The following code sets the max searching time as 3 hours:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', max_trials=300, early_stopping_time_limit=3600 * 3)
estimator = experiment.run()
print(estimator)

Choose a Searcher

HyperGBM performs hyperparameter search using the search algorithms provided by Hypernets, which includes EvolutionSearch, MCTSSearcher, and RandomSearcher. One can choose a specific searcher by setting the parameter searcher when using make_experiment.

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', searcher='random')
estimator = experiment.run()
print(estimator)

Furthermore, you can make a new searcher object for experiment, for an example:

from hypergbm import make_experiment
from hypergbm.search_space import search_space_general
from hypernets.searchers import MCTSSearcher
from hypernets.tabular.datasets import dsutils

my_searcher = MCTSSearcher(lambda: search_space_general(n_estimators=100),
                           max_node_space=20,
                           optimize_direction='max')

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', searcher=my_searcher)
estimator = experiment.run()
print(estimator)

Ensemble Models

make_experiment automatically turns on the model ensemble function to achieve a better model. It will ensemble the best 20 models while the number for ensembling can be changed by setting ensemble_size as the following code, where ensemble_size=0 means no ensembling wii be made.

train_data = ...
experiment = make_experiment(train_data, ensemble_size=10, ...)

Set Log Levels

The progress messages during training can be printed by setting log_level (str or int). Please refer the logging package of python for further details. Besides, more comprehensive messages will be printed when setting verbose as 1.

The following codes sets the log level to ‘INFO’:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', log_level='INFO', verbose=1)
estimator = experiment.run()
print(estimator)

Output:

14:24:33 I hypernets.tabular.u._common.py 30 - 2 class detected, {0, 1}, so inferred as a [binary classification] task
14:24:33 I hypergbm.experiment.py 699 - create experiment with ['data_clean', 'drift_detection', 'space_search', 'final_ensemble']
14:24:33 I hypergbm.experiment.py 1262 - make_experiment with train data:(748, 4), test data:None, eval data:None, target:Class
14:24:33 I hypergbm.experiment.py 716 - fit_transform data_clean
14:24:33 I hypergbm.experiment.py 716 - fit_transform drift_detection
14:24:33 I hypergbm.experiment.py 716 - fit_transform space_search
14:24:33 I hypernets.c.meta_learner.py 22 - Initialize Meta Learner: dataset_id:7123e0d8c8bbbac8797ed9e42352dc59
14:24:33 I hypernets.c.callbacks.py 192 - 
Trial No:1
--------------------------------------------------------------
(0) estimator_options.hp_or:                                0
(1) numeric_imputer_0.strategy:                 most_frequent
(2) numeric_scaler_optional_0.hp_opt:                    True


...

14:24:35 I hypergbm.experiment.py 716 - fit_transform final_ensemble
14:24:35 I hypergbm.experiment.py 737 - trained experiment pipeline: ['data_clean', 'estimator']
Pipeline(steps=[('data_clean',
                 DataCleanStep(...),
                ('estimator',
                 GreedyEnsemble(...)

Experiment Visualization

HyperGBM supports user interface based on webpage by setting the argument webui= True, where you see all the processing and parameters information displayed in a dashboard.

Note: This function requires to install hypergbm with the command:

pip install hypergbm[board]

The example codes of enabling experiment visualization based on website is shown below:

from sklearn.model_selection import train_test_split

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

df = dsutils.load_bank()
df_train, df_test = train_test_split(df, test_size=0.8, random_state=42)

experiment = make_experiment(df_train, target='y', webui=True)
estimator = experiment.run(max_trials=10)

print(estimator)

The output is:

02-17 19:08:48 I hypernets.t.estimator_detector.py 85 - EstimatorDetector error: GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1
...
server is running at: 0.0.0.0:8888 
...

02-17 19:08:55 I hypernets.t.metrics.py 153 - calc_score ['auc', 'accuracy'], task=binary, pos_label=yes, classes=['no' 'yes'], average=None
final result:{'auc': 0.8913467492260062, 'accuracy': 0.8910699474702792}

Then you could see the experiment progress dashboard by accessing the web server http://localhost:8888. One screenshot is displayed below: images/experiment-web-visualization.png

It also support other options to configure the webui: defining the file directory by event_file_dir, setting the server port by server_port, and defining if exiting the web server after finishing the current experiment by exit_web_server_on_finish. See example:

...
webui_options = {
    'event_file_dir': "./events",  # persist experiment running events log to './events'
    'server_port': 8888, # http server port
    'exit_web_server_on_finish': False  # exit http server after experiment finished
}
experiment = make_experiment(df_train, target='y', webui=True, webui_options=webui_options)
...

Advanced applications

HyperGBM make_experiment create an instance of CompeteExperiment in Hypernets. There are many advanced features of CompeteExperiment which will be covered in this section.

flowchart LR da[Data<br/>Adaption] dc[Data<br/>Cleaning] fg[Feature generation] cd[collinearity detection] dd[Drift detection] fs[Feature selection] s1[Search optimization] pi[2nd-stage<br/>Feature<br/>selection] pl[Pseudo label] s2[2nd-stage<br/>search optimization] em[Model<br/>ensemble] op2[op] subgraph 1st-stage direction LR subgraph op direction TB cd-->dd dd-->fs end fg-->op op-->s1 end subgraph 2nd-stage direction LR subgraph op2 direction TB pi-->pl end op2-->s2 end da-->dc-->1st-stage-->2nd-stage-->em style 2nd-stage stroke:#6666,stroke-width:2px,stroke-dasharray: 5, 5;

Data Adaption

This step supports Pandas/Cuml data types only, relevant parameters:

  • data_adaption:(default True). Whether to enable data adaption.

  • data_adaption_memory_limit:(default 0.05). If float, should be between 0.0 and 1.0 and represent the proportion of the system free memory. If int, represents the absolute byte number of memory.

  • data_adaption_min_cols:(default 0.3). If float, should be between 0.0 and 1.0 and represent the proportion of the original dataframe column number. If int, represents the absolute column number.

  • data_adaption_target:(default None),Where to run the next steps. ‘cuml’ or ‘cuda’, adapt training data into cuml datatypes and run next steps on nvidia GPU Devices. None, not change the training data types.

Data cleaning

CompeteExperiment performs data cleaning with DataCleaner in Hypernets. Note that this step can not be disabled but can be adjusted with DataCleaner in the following ways:

  • nan_chars: value or list, (default None), replace some characters with np.nan

  • correct_object_dtype: bool, (default True), whether correct the data types

  • drop_constant_columns: bool, (default True), whether drop constant columns

  • drop_duplicated_columns: bool, (default False), whether delete repeated columns

  • drop_idness_columns: bool, (default True), whether drop id columns

  • drop_label_nan_rows: bool, (default True), whether drop rows with target values np.nan

  • replace_inf_values: (default np.nan), which values to replace np.nan with

  • drop_columns: list, (default None), drop which columns

  • reserve_columns: list, (default None), reserve which columns when performing data cleaning

  • reduce_mem_usage: bool, (default False), whether try to reduce the memory usage

  • int_convert_to: bool, (default ‘float’), transform int to other types,None for no transformation

If nan is represented by ‘\N’ in data,users can replace ‘\N’ back to np.nan when performing data cleaning as follows:

from hypergbm import make_experiment

train_data = ...
experiment = make_experiment(train_data, target='...',
                            data_cleaner_args={'nan_chars': r'\N'})
...

Feature generation

CompeteExperiment is capable of performing feature generation, which can be turned on by setting feature_generation=True when creating experiment with make_experiment. There are several options:

  • feature_generation_continuous_cols:list (default None)), continuous feature, inferring automatically if set as None.

  • feature_generation_categories_cols:list (default None)), categorical feature, need to be set explicitly, CompeteExperiment can not perform automatic inference for this one.

  • feature_generation_datetime_cols:list (default None), datetime feature, inferring automatically if set as None.

  • feature_generation_latlong_cols:list (default None), latitude and longtitude feature, inferring automatically if set as None.

  • feature_generation_text_cols:list (default None), text feature, inferring automatically if set as None.

  • feature_generation_trans_primitives:list (default None), transformations for feature generation, inferring automatically if set as None.

When feature_generation_trans_primitives=None, CompeteExperiment will automatically infer the types used for transforming based on the default features. Specifically, different transformations will be adopted for different types:

  • continuous_cols: None, need to be set explicitly.

  • categories_cols: cross_categorical.

  • datetime_cols: month, week, day, hour, minute, second, weekday, is_weekend.

  • latlong_cols: haversine, geohash

  • text_cols:tfidf

An example code for enabling feature generation:

from hypergbm import make_experiment

train_data = ...
experiment = make_experiment(train_data,
                           feature_generation=True,
                           ...)
...

Please refer to [featuretools](https://docs.featuretools.com/) for more information.

Collinearity detection

There will often be some highly relevant features which are not informative but are more seen as noises. They are not very useful. On the contrary, the dataset will be affected by drifts of these features more heavily.

It is possible to handle these collinear features with CompeteExperiment. This can be simply enabled by setting collinearity_detection=True when creating experiment.

Example code for using collinearity detection

from hypergbm import make_experiment

train_data = ...
experiment = make_experiment(train_data, target='...', collinearity_detection=True)
...

Drift detection

Concept drift is one of the major challenge for machine learning. The model will often perform worse in practice due to the fact that the data distributions will change along with time. To handle this problem, CompeteExeriment adopts Adversarial Validation to detect whether there is any drifted features and drop them to maintain a good performance.

To enable drift detection, one needs to set drift_detection=True when creating experiment and provide test_data.

Relevant parameters:

  • drift_detection_remove_shift_variable : bool, (default=True), whether to detect the stability of every column first.

  • drift_detection_variable_shift_threshold : float, (default=0.7), stability socres higher than this value will be dropped.

  • drift_detection_threshold : float, (default=0.7), detecting scores higher than this value will be dropped.

  • drift_detection_remove_size : float, (default=0.1), ratio of columns to be dropped.

  • drift_detection_min_features : int, (default=10), the minimal number of columns to be reserved.

  • drift_detection_num_folds : int, (default=5), the number of folds for cross validation.

An code example:

from io import StringIO
import pandas as pd
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

test_data = """
Recency,Frequency,Monetary,Time
2,10,2500,64
4,5,1250,23
4,9,2250,46
4,5,1250,23
4,8,2000,40
2,12,3000,82
11,24,6000,64
2,7,1750,46
4,11,2750,61
1,7,1750,57
2,11,2750,79
2,3,750,16
4,5,1250,26
2,6,1500,41
"""

train_data = dsutils.load_blood()
test_df = pd.read_csv(StringIO(test_data))
experiment = make_experiment(train_data, test_data=test_df,
                             drift_detection=True,
                             ...)

...

Feature selection

CompeteExperiment evaluates the importances of features by training a model. Then it chooses the most important ones among them to continue the model training.

To enable feature selection, one needs to set feature_selection=True when creating experiment. Relevant parameters:

  • feature_selection_strategy:str, selection strategies(default threshold), can be chose from threshold, number and quantile.

  • feature_selection_threshold:float, (default 0.1), selection threshold when the strategy is threshold, features with scores higher than this threshold will be selected.

  • feature_selection_quantile:float, (default 0.2), selection threshold when the strategy is quantile, features with scores higher than this threshold will be selected.

  • feature_selection_number:int or float, (default 0.8), selection numbers when the strategy is number.

An example code:

from hypergbm import make_experiment

train_data=...
experiment = make_experiment(train_data,
                             feature_selection=True,
                             feature_selection_strategy='quantile',
                             feature_selection_quantile=0.3,
                             ...)

The second stage feature selection

CompeteExperiment supports continuing data processing with the trained model, which is officially called Two-stage search. There are two types of Two-stage processings supported by CompeteExperiment: Two-stage feature selection and pseudo label which will be covered in the rest of this section.

In CompeteExperiment, the second stage feature selection is to choose models with good performances in the first stage, and use permutation_importance to evaluate them to give better features.

To enable the second stage feature selection, one needs to set feature_reselection=True when creating experiment. Relevant parameters:

  • feature_reselection_estimator_size:int, (default=10), the number of models to be used for evaluating the importances of feature (top n best models in the first stage).

  • feature_reselection_strategy:str, selection strategy(default threshold), available selection strategies include threshold, number, quantile.

  • feature_reselection_threshold:float, (default 1e-5), threshold when the selection strategy is threshold, importance scores higher than this values will be choosed.

  • feature_reselection_quantile:float, (default 0.2), threshold when the selection strategy is quantile, importance scores higher than this values will be choosed.

  • feature_reselection_number:int or float, (default 0.8), the number of features to be selected when the strategy is number.

An example code:

from hypergbm import make_experiment

train_data=...
experiment = make_experiment(train_data,
                             feature_reselection=True,
                             ...)

Please refer to [scikit-learn](https://scikit-learn.org/stable/modules/permutation_importance.html) for more information about permutation_importance.

Pseudo label

Pseudo label is a kind of semi-supervised machine learning method. It will assign labels predicted by the model trained in the first stage to some examples in test data. Then examples with higher confidence values than a threshold will be added into the trainig set to train the model again.

To enable feature selection, one needs to set pseudo_labeling=True when creating experiment. Relevant parameters:

  • pseudo_labeling_strategy:str, selection strategy(default threshold), available strategies include threshold, number and quantile.

  • pseudo_labeling_proba_threshold:float(default 0.8), threshold when the selection strategy is threshold, confidence scores higher than this values will be choosed.

  • pseudo_labeling_proba_quantile:float(default 0.8), threshold when the selection strategy is quantile, importance scores higher than this values will be choosed.

  • pseudo_labeling_sample_number:float(0.0~1.0) or int (default 0.2), the number of top features to be selcected when the strategy is number.

  • pseudo_labeling_resplit:bool(default=False), whether split training and validation set after adding pseudo label examples. If set as False, all examples with pseudo labels will be added into training set to train the model. Otherwise, experiment will perform training set and validation set splitting for the new dataset with pseudo labels.

An example code:

from hypergbm import make_experiment

train_data=...
test_data=...
experiment = make_experiment(train_data,
                             test_data=test_data,
                             pseudo_labeling=True,
                             ...)

Note: Pseudo label is only valid for classification task.

Handling Imbalanced Data

Imbalanced data problem is one of the most often encountered challanges in practice, which will usually leads to barely satisfactory models. To alleviate this problem, HyperGBM supports two solutions as follows:

Adopt ClassWeight

When building the model such as LightGBM, one first calculates the data distributions and assign different weights to different classes according to their distributions when computing loss. To enable ClassWeight algorithm, one can simply set the parameter ``class_balancing=’ClassWeight’when usingmake_experiment`.

from hypergbm import make_experiment

train_data = ...
experiment = make_experiment(train_data,
                             class_balancing='ClassWeight',
                             ...)

UnderSampling and OverSampling

The most common approach to handle the data imbalance problem is to modify the data distribution to get a more balanced dataset. Then one trains the model with the modified dataset. Currently, HyperGBM supports several resampling strategies including RandomOverSampler, SMOTE, ADASYN, RandomUnderSampler, NearMiss, TomekLinks, and EditedNearestNeighbours. To enable different sampling methods, one only needs to set class_balancing='<selected strategy>' when using make_experiment. Please refer to the following example:

To enable UnderSampling and OverSampling, set class_balancing=‘<strategy>’ when creating experiment. An example code is as follows:

from hypergbm import make_experiment

train_data = ...
experiment = make_experiment(train_data,
                             class_balancing='SMOTE',
                             ...)

For more information regarding these sampling methods, please see imbalanced-learn.

Search Space

When not defined explicitly, make_experiment will use search_space_general as its search space, which is defined as follows

search_space_general = GeneralSearchSpaceGenerator(n_estimators=200)

Define Search Space

To use a specific search space, one can change the parameter search_space when calling make_experiment. Taking defining the max_depth as 20 for xgboost as an example:

from hypergbm import make_experiment
from hypergbm.search_space import GeneralSearchSpaceGenerator

my_search_space = \
    GeneralSearchSpaceGenerator(n_estimators=200, xgb_init_kwargs={'max_depth': 20})

train_data = ...

experiment = make_experiment(train_data,
                             search_space=my_search_space,
                             ...)

If you want to use searchable parameters, we recommend doing this by defining a subclass of GeneralSearchSpaceGenerator. For example, if we want the algorithm to search among 3 choices of the max_depth for xgboost:

from hypergbm import make_experiment
from hypergbm.search_space import GeneralSearchSpaceGenerator
from hypernets.core.search_space import Choice

class MySearchSpace(GeneralSearchSpaceGenerator):
    @property
    def default_xgb_init_kwargs(self):
        return { **super().default_xgb_init_kwargs,
                'max_depth': Choice([10, 20 ,30]),
        }

my_search_space = MySearchSpace()
train_data = ...

experiment = make_experiment(train_data, 
                             search_space=my_search_space,
                             ...)

Support Machine Learning Models

HyperGBM has already supported XGBoost, LightGBM, CatBoost, and HistGradientBoosting. They are taken as components of the Search Space to be searched for training a model. Supporting other machine learning algorithms can be done by following 3 steps:

  • Encapsulating your algorithms as a subclass of HyperEstimator

  • Add the encapsulated algorithms to the search sapce and define the search parameters

  • Use your Search Space in make_experiment

Please see the following example:

from sklearn import svm

from hypergbm import make_experiment
from hypergbm.estimators import HyperEstimator
from hypergbm.search_space import GeneralSearchSpaceGenerator
from hypernets.core.search_space import Choice, Int, Real
from hypernets.tabular.datasets import dsutils


class SVMEstimator(HyperEstimator):
    def __init__(self, fit_kwargs, C=1.0, kernel='rbf', gamma='auto', degree=3, random_state=666, probability=True,
                 decision_function_shape=None, space=None, name=None, **kwargs):
        if C is not None:
            kwargs['C'] = C
        if kernel is not None:
            kwargs['kernel'] = kernel
        if gamma is not None:
            kwargs['gamma'] = gamma
        if degree is not None:
            kwargs['degree'] = degree
        if random_state is not None:
            kwargs['random_state'] = random_state
        if decision_function_shape is not None:
            kwargs['decision_function_shape'] = decision_function_shape
        kwargs['probability'] = probability
        HyperEstimator.__init__(self, fit_kwargs, space, name, **kwargs)

    def _build_estimator(self, task, kwargs):
        if task == 'regression':
            hsvm = SVMRegressorWrapper(**kwargs)
        else:
            hsvm = SVMClassifierWrapper(**kwargs)
        hsvm.__dict__['task'] = task
        return hsvm


class SVMClassifierWrapper(svm.SVC):
    def fit(self, X, y=None, **kwargs):
        return super().fit(X, y)


class SVMRegressorWrapper(svm.SVC):
    def fit(self, X, y=None, **kwargs):
        return super().fit(X, y)


class GeneralSearchSpaceGeneratorPlusSVM(GeneralSearchSpaceGenerator):
    def __init__(self, enable_svm=True, **kwargs):
        super(GeneralSearchSpaceGeneratorPlusSVM, self).__init__(**kwargs)
        self.enable_svm = enable_svm

    @property
    def default_svm_init_kwargs(self):
        return {
            'C': Real(0.1, 5, 0.1),
            'kernel': Choice(['rbf', 'poly', 'sigmoid']),
            'degree': Int(1, 5),
            'gamma': Real(0.0001, 5, 0.0002)
        }

    @property
    def default_svm_fit_kwargs(self):
        return {}

    @property
    def estimators(self):
        r = super().estimators
        if self.enable_svm:
            r['svm'] = (SVMEstimator, self.default_svm_init_kwargs, self.default_svm_fit_kwargs)
        return r


my_search_space = GeneralSearchSpaceGeneratorPlusSVM()

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class',
                             search_space=my_search_space)
estimator = experiment.run()
print(estimator)

GPU acceleration

To accelerate HyperGBM with NVIDIA GPU devices, you must install NVIDIA RAPIDS cuML and cuDF, and enable GPU support of all estimators, see Installation Guide for more details.

Accelerate the experiment

To accelerate the experiment with GPU, you should load dataset as cudf.DataFrame and use them as train_data/eval_data/test_data arguments to call the utility make_experiment, the utility will set experiment to run on GPU device.

Example:

import cudf

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils


def train():
    train_data = cudf.from_pandas(dsutils.load_blood())

    experiment = make_experiment(train_data, target='Class')
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()

Outputs:

LocalizablePipeline(steps=[('data_clean',
                            DataCleanStep(cv=True,
                                          name='data_cle...
                            CumlGreedyEnsemble(weight=[...]))])

It should be noted that the trained estimator is a LocalizablePipeline rather than a sklearn Pipeline. The Localizablepipeline accepts cudf DataFrame as input X for prediction. When you deploy the LocalizablePipeline in a production environment, you need to install the same software as the training environment, including cuML, cuDF, etc.

If you want to deploy the trained estimator in an environment without cuML and cuDF, please call the estimator.as_local() to converts it into a sklearn Pipeline. An example:

import cudf

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils


def train():
    train_data = cudf.from_pandas(dsutils.load_blood())

    experiment = make_experiment(train_data, target='Class')
    estimator = experiment.run()
    print(estimator)

    print('-' * 20)
    estimator = estimator.as_local()
    print('localized estimator:\n', estimator)


if __name__ == '__main__':
    train()

Outputs:

LocalizablePipeline(steps=[('data_clean',
                            DataCleanStep(cv=True,
                                          name='data_cle...
                            CumlGreedyEnsemble(weight=[...]))])
--------------------
localized estimator:
 Pipeline(steps=[('data_clean',
                 DataCleanStep(cv=True,
                               name='data_clean')),
                ('est...
                 GreedyEnsemble(weight=[...]))])

Customize Search Space

When running an experiment on GPU, all Transformers and Estimators used in the search space need to support both pandas/numpy data types and cuDF/cupy data types. Users can define new search space based on the search_space_general and CumlGeneralSearchSpaceGenerator from hypergbm.cuml.

An example code:

import cudf

from hypergbm import make_experiment
from hypergbm.cuml import search_space_general
from hypernets.tabular.datasets import dsutils


def my_search_space():
    return search_space_general(n_estimators=100)


def train():
    train_data = cudf.from_pandas(dsutils.load_blood())

    experiment = make_experiment(train_data, target='Class', searcher='mcts', search_space=my_search_space)
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()

Distributed training

Quick Experiment

HyperGBM supports performing distributed training with Dask. Before training, the Dask collections should be deployed and Client object of Dask should be initialized. Training data file with extensions such as csv and parquet can be adopted by make_experiment directly with the file path. And make_experiment will automatically load the data as DataFrame object of Dask if the environment of Dask is detected.

Suppose that your training data file is ‘/opt/data/my_data.csv’, the following code shows how to load data for a single node:

from dask.distributed import LocalCluster, Client

from hypergbm import make_experiment


def train():
    cluster = LocalCluster(processes=True)
    client = Client(cluster)

    train_data = '/opt/data/my_data.csv'

    experiment = make_experiment(train_data, target='...')
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()

We recommend splitting the data to multiple files and save them in a single location such as ‘/opt/data/my_data’ for large-scale data to speed up the loading process. After doing this, one can create an exmperiment with the splited files:

from dask.distributed import LocalCluster, Client

from hypergbm import make_experiment


def train():
    cluster = LocalCluster(processes=True)
    client = Client(cluster)

    train_data = '/opt/data/my_data/*.parquet'

    experiment = make_experiment(train_data, target='...')
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()

Please also refer to the official documents of Dask Create DataFrames for further details on how to use Dask DataFrame.

Define Search Space

When running an experiment in the Dask environment, the Transformer and Estimator used in the search space need to support Dask data type. Users can define new search space based on the default search space of HyperGBM which supports Dask.

An example code:

from dask import dataframe as dd
from dask.distributed import LocalCluster, Client

from hypergbm import make_experiment
from hypergbm.dask import search_space_general
from hypernets.tabular.datasets import dsutils


def my_search_space():
    return search_space_general(n_estimators=100)


def train():
    cluster = LocalCluster(processes=False)
    client = Client(cluster)

    train_data = dd.from_pandas(dsutils.load_blood(), npartitions=1)

    experiment = make_experiment(train_data, target='Class', searcher='mcts', search_space=my_search_space)
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()


How-To

How to install shap on centos7?

  1. Install system dependencies

    yum install epel-release centos-release-scl -y  && yum clean all && yum make cache # llvm9.0 is in epel, gcc9 in scl
    yum install -y llvm9.0 llvm9.0-devel python36-devel devtoolset-9-gcc devtoolset-9-gcc-c++ make cmake 
    
  2. Configure installing environment

    whereis llvm-config-9.0-64  # find your `llvm-config` path
    # llvm-config-9: /usr/bin/llvm-config-9.0-64
    
    export LLVM_CONFIG=/usr/bin/llvm-config-9.0-64  # set to your path
    scl enable devtoolset-9 bash
    
  3. Install shap

    pip3 -v install numpy==1.19.1  # prepare shap dependency
    pip3 -v install scikit-learn==0.23.1  # prepare shap dependency
    pip3 -v install shap==0.28.5
    

How to customize reward_metric in HyperGBM?

To customize a new reward_metric, do the followings:

  1. Define a function (not lambda) with argument y_true and y_preds

  2. Make a sklearn scorer with your function

  3. Call make_experiment with your reward_metric and scorer

Example code:

from sklearn.metrics import make_scorer, accuracy_score

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils


def foo(y_true, y_preds):
    return accuracy_score(y_true, y_preds)  # replace this line with yours

my_scorer = make_scorer(foo, greater_is_better=True, needs_proba=False)

train_data = dsutils.load_adult()
train_data.columns = [f'c{i}' for i in range(14)] + ['target']

exp = make_experiment(train_data.copy(), target='target',
                      reward_metric=foo,
                      scorer=my_scorer,
                      max_trials=3,
                      log_level='info')
estimator = exp.run()
print(estimator)

How to customize storage in HyperGBM?

HyperGBM store intermediate data (model files, cache data, etc.) in the work_dir, which is a subdirectory under your system temporary directory by default.


  • Customize the work_dir location

  1. Create directory conf under the location where you start hypergbm or your python script

  2. Create file storage.py under the conf directory, with the content:

    c.StorageCfg.root = '/your/full/path/for/work_dir'
    
  3. Run hypergbm command or start your python script as normal


  • Use s3 compatible storage as HyperGBM work_dir

  1. Install s3fs

  2. Create directory conf under the location where you start hypergbm or your python script

  3. Create file storage.py under the conf directory, with the content:

    c.StorageCfg.kind = 's3'
    c.StorageCfg.root = '/bucket_name/some_path'
    c.StorageCfg.options = """
    {
            "anon": false,
            "client_kwargs": {
                    "endpoint_url": "your_service_address",
                    "aws_access_key_id": "your_access_key",
                    "aws_secret_access_key": "your_secret_access_key"
            }
    }
    """
    

Refer to s3fs for more installation and connection information.

How to run HyperGBM on kaggleM?

We recommend that you install HyperGBM with pip on kaggle, see notebook_hypergbm_bank_marketing_kaggle as an example.

How to accelerate HyperGBM on kaggle with GPU?

Prerequisite

  • Make sure you have available GPU hours.

  • Enable GPU as your accelerator.

See notebook_hypergbm_bank_marketing_gpu_kaggle as an example.

Released Notes

Releasing history:

Version 0.2.5

We add the following new features to this version:

  • Full pipeline GPU acceleration

    • Data adaption

    • Data cleaning

    • Feature selection

    • Data drift detection

    • Feature selection(2nd stage)

    • Pseudo labeling(2nd stage)

    • Optimization

      • Data preprocessing

      • Model fitting

    • Model ensemble

    • Metrics

  • Model training

    • Add TargetEncoder for categories

    • Set estimator eval_metric based on experiment reward_metric

  • Advanced Features

    • Data adaption in experiment

  • Experiment Visualization

    • Experiment configurations

    • Dataset information

    • Processing information

  • Multijob management

    • Series and parallel jobs scheduling

    • Local and remote jobs execution

  • Export experiment report

Version 0.2.3

We add the following new features to this version:

  • Data cleaning

    • Support automatically recognizing categorical columns among features with numerical datatypes

    • Support performing data cleaning with several specific columns reserved

  • Feature generation

    • Support datatime, text and Latitude and Longitude features

    • Support distributed training

  • Modelling algorithms

    • XGBoost:Change distributed training from dask_xgboost to xgboost.dask to be compatible with official website of XGBoost

    • LightGBM:Support distributed trianing for more machines

  • Model training

    • Support reproducing the searching process

    • Support searching with low fidelity

    • Predicting learning curves based on statistical information

    • Support hyperparameter optimizing without making modification

    • Time limit of EarlyStopping is now adjusted to the whole experiment life-cycle

    • Support defining pos_label

    • eval-set supports Dask dataset for distributed training

    • Optimizing the cache strategy for model training

  • Search algorithms

    • Add GridSearch algorithm

    • Add Playback algorithm

  • Advanced Features

    • Add feature selection with various strategies for the first stage

    • Feature selection for the second stage now supports more strategies

    • Pseudo-label supports various data selection strategies and multi-class classification

    • Optimizing performance of concepts drift handling

    • Add cache mechanism during processing of advanced features

  • Visualization

    • Experiment information visualization

    • Training process visualization

  • Command Line tool

    • Most features of experiments for model training are now supported by command line tools

    • Support model evaluating

    • Support model predicting

Version 0.2.2

We add the following new features to this version:

Feature engineering

  • Feature generating

  • Feature dimension reduction

Data cleaning

  • Missing characters handling

  • Column types correction

  • Constant columns cleaning

  • Repeat columns cleaning

  • Deleating examples with missing targets

  • Replacing invalid values

  • id columns cleaning

Dataset splitting

  • Adversarial validation

Modelling algorithms

  • XGBoost

  • Catboost

  • LightGBM

  • HistGridientBoosting

Model training

  • Automatic task inferencing

  • Command line tools

Evaluation methods

  • Cross-Validation

  • Train-Validation-Holdout

Search Algorithms

  • Monte-Carlo Tree search

  • Evolution algorithms

  • Random search

Imbalanced data handling

  • Class Weight

  • Under-sampling

    • Near miss

    • Tomeks links

    • Random

  • Over-sampling

    • SMOTE

    • ADASYN

    • Random

Early-stopping strategy

  • stopping after n times searching without improving

  • stopping after using a maximal time

  • stopping after achieving expected performance

Advanced Features

  • Two-stage search

    • Pseudo-label

    • Feature selection

  • Concepts drift handling

  • Model ensemble

Indices and tables