DataCanvas

HyperGBM is an open source project created by DataCanvas .

Outline

About HyperGBM

HyperGBM is a Full-Pipeline Automated Machine Learning Tool with functions ranging from data cleaning, preprocessing and feature engineering to model selection and hyperparameter tuning. It is an advanced AutoML tool for tabular data.

While a lot of AutoML tools mainly focus on the hyperparameter tuning of different algorithms, HyperGBM designs a high-level search space to include almost all components of machine learning modelling into it, such as data cleaning and algorithm optimizing. This end-to-end optimization approach is more close to a SDP(Sequential Decision Process). Therefore, combined with a meta-learner, HyperGBM adopts advanced algorithms such as reinforcement learning and Monte-Carlo tree search to solve the full-pipeline optimization problem more effectively. These strategies are proven to be effective in practice.

For the machine leanring models, HyperGBM uses popular gradient-boosting tree models ranging from XGBoost, LightGBM and HistGradientBoosting. Besides, HyperGBM also involves many advanced features of CompeteExperiment from Hypernets in data cleaning, feature engineering and model ensemble.

The optimization algorithms, representations of search space and CompeteExperiment are based on Hypernets.

Features

There are three types of running HyperGBM:

  • Single node:running in a single machine and using Pandas and Numpy datatype

  • Distributed with single node:running in a single machine and using Dask datatype which requires creating Dask collections before using HyperGBM

  • Distributed with multi nodes:running in multiple machines and using Dask datatype which requires creating Dask collections to manage resources for multiple machines before using HyperGBM

The supported features are different for different running types as in the following table:

Features

Single Machine

Distributed with single node

Distributed with multi nodes

Data Cleaning

Empty characters handling

Recognizing columns types automatically

Columns types correction

Constant columns cleaning

Repeated columns cleaning

Deleting examples without targets

Illegal characters replacing

id columns cleaning

Dataset splitting

Splitting by ratio

Adversarial validation

Feature engineering

Feature generation

Feature dimension reduction

Data preprocessing

SimpleImputer

SafeOrdinalEncoder

SafeOneHotEncoder

TruncatedSVD

StandardScaler

MinMaxScaler

MaxAbsScaler

RobustScaler

Imbalanced data handling

ClassWeight

UnderSampling(Nearmiss,Tomekslinks,Random)

OverSampling(SMOTE,ADASYN,Random)

Search algorithms

MCTS

Evolution

Random search

Play back

Early stopping

time limit

no improvements are made after n trials

expected_reward

trail discriminator

Modeling algorithms

XGBoost

LightGBM

CatBoost

HistGridientBoosting

Evaluation

Cross-Validation

Train-Validation-Holdout

Advanced features

Automatica task type inference

Collinearity detection

Data drift detection

Feature selection

Feature selection(Two-stage)

Pseudo label(Two-stage)

Pre-searching with UnderSampling

Model ensemble

Installing HyperGBM

We recommend installing HyperGBM with pip. Installing and using HyperGBM in a Docker container are also possible if you have a Docker environment.

pip

Python version 3.6 or above is necessary before installing HyperGBM. Here is how to use pip to install HyperGBM:

pip install hypergbm

Note that HyperGBM and its necessary packages will be installed at this time. If you want to install HyperGBM along with dependent packages such as shap, the following way is recommended:

pip install hypergbm[all]

Docker

It is possible to use HyperGBM in a Docker container. To do this, users can install HyperGBM with pip in the Dockerfile. We also publish a mirror image in Docker Hub which can be downloaded directly and includes the following components:

  • Python3.7

  • HyperGBM and its dependent packages

  • JupyterLab

Download the mirror image:

docker pull datacanvas/hypergbm

Use the mirror image:

docker run -ti -e NotebookToken="your-token" -p 8888:8888 datacanvas/hypergbm

Then one can visit http://<your-ip>:8888 in the browser and type in the default token to start.

Quick Start

We will introduce main features of HyperGBM in this section and assuming that you have knowledge of machine learning such as loading data and training a model. If you have not completed installation, please refer to [Installation](installation.md) to install HyperGBM. You can use HyperGBM with command line tools and Python.

Use HyperGBM with Python

HyperGBM is developed with Python. We recommend using the Python tool make_experiment to create experiment and train the model.

The basic steps for training the model with make_experiment are as follows:

  • Prepare the dataset(pandas or dask DataFrame)

  • Create experiment with make_experiment

  • Call the .run() method of experiment to performing training and get the model

  • Predict with trained model or save it with the Python tool pickle

Prepare the dataset

Both pandas and dask can be loaded depending on your task types to get DataFrame for training the model.

Taking loading the sklearn dataset breast_cancer as an example,one can get the dataset by following several procedures:

import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split

X,y = datasets.load_breast_cancer(as_frame=True,return_X_y=True)
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_state=335)
train_data = pd.concat([X_train,y_train],axis=1)

where train_data is used for model trianing while X_test are y_test used for evaluating the model.

Create experiment with make_experiment

Users can creating experiment for the prepared dataset and start training the model following procedures below:

from hypergbm import make_experiment


experiment = make_experiment(train_data, target='target', reward_metric='precision')
estimator = experiment.run()

where estimator is the trianed model.

Save the model

It is recommended to save the model with pickle

import pickle
with open('model.pkl','wb') as f:
  pickle.dump(estimator, f)

Evaluate the model

The model can be evaluated with tools provided by sklearn:

from sklearn.metrics import classification_report

y_pred=estimator.predict(X_test)
print(classification_report(y_test, y_pred, digits=5))

output:

              precision    recall  f1-score   support

           0    0.96429   0.93103   0.94737        58
           1    0.96522   0.98230   0.97368       113

    accuracy                        0.96491       171
   macro avg    0.96475   0.95667   0.96053       171
weighted avg    0.96490   0.96491   0.96476       171

More info:

Please refer to the docstring of make_experiment for more information about it:

print(make_experiment.__doc__)

If you are using Notebook or IPython, the following code can provide more information about make_experiment:

make_experiment?

Use HyperGBM with Command Line

HyperGBM offers command line tool hypergbm to perform model training, evaluation and prediction. The following code enables the user to view command line help:

hypergm -h

usage: hypergbm [-h] [--log-level LOG_LEVEL] [-error] [-warn] [-info] [-debug]
                [--verbose VERBOSE] [-v] [--enable-dask ENABLE_DASK] [-dask]
                [--overload OVERLOAD]
                {train,evaluate,predict} ...

hypergbm offers three commands: train, evaluate and predict. To get more information, one can use hypergbm <command> -h:

hypergbm train -h
usage: hypergbm train [-h] --train-data TRAIN_DATA [--eval-data EVAL_DATA]
                      [--test-data TEST_DATA]
                      [--train-test-split-strategy {None,adversarial_validation}]
                      [--target TARGET]
                      [--task {binary,multiclass,regression}]
                      [--max-trials MAX_TRIALS] [--reward-metric METRIC]
                      [--cv CV] [-cv] [-cv-] [--cv-num-folds NUM_FOLDS]
                      [--pos-label POS_LABEL]
                      ...

Prepare the Data

When training model with command line, the training data must be saved in a file of form of csv or parque. The returned model is in the form of pickle whoes file ends with .pkl.

For an example of training Bank Marketing data, one can prepare the data as follows:

from hypernets.tabular.datasets import dsutils
from sklearn.model_selection import train_test_split

df = dsutils.load_bank().head(10000)
df_train, df_test = train_test_split(df, test_size=0.3, random_state=9527)
df_train.to_csv('bank_train.csv', index=None)
df_test.to_csv('bank_eval.csv', index=None)

df_test.pop('y')
df_test.to_csv('bank_to_pred.csv', index=None)

where

  • bank_train.csv is used for training

  • bank_eval.csv is used for evaluating the model

  • bank_to_pred.csv is data without targets for predicting

Train the Model

After preparing the data, one can also perform model training with command line:

hypergbm train --train-data bank_train.csv --target y --model-file model.pkl

one will see model.pkl after this process

ls -l model.pkl

rw-rw-r-- 1 xx xx 9154959    17:09 model.pkl

Evaluate the Model

The trained model can be evaluated with the evaluation data:

hypergbm evaluate --model model.pkl --data bank_eval.csv --metric f1 recall auc

{'f1': 0.7993779160186626, 'recall': 0.7099447513812155, 'auc': 0.9705420982746849}

Predict the Test Data

The trained model can be used for predicting a given data as follows:

hypergbm predict --model model.pkl --data bank_to_pred.csv --output bank_output.csv

where the predicting result will be saved to bank_output.csv.

To add other columns of your predicted data to the above file, one can use the parameter --with-data explicitly:

hypergbm predict --model model.pkl --data bank_to_pred.csv --output bank_output.csv --with-data id
head bank_output.csv

id,y
1563,no
124,no
218,no
463,no
...

Furthermore, including all columns of the test data besides the predicting results to the file bank_output.csv can be done by setting --with-data as “*”:

hypergbm predict --model model.pkl --data bank_to_pred.csv --output bank_output.csv --with-data '*'
head bank_output.csv

id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
1563,55,entrepreneur,married,secondary,no,204,no,no,cellular,14,jul,455,13,-1,0,unknown,no
124,51,management,single,tertiary,yes,-55,yes,no,cellular,11,may,281,2,266,6,failure,no
218,49,blue-collar,married,primary,no,305,yes,yes,telephone,10,jul,834,10,-1,0,unknown,no
463,35,blue-collar,divorced,secondary,no,3102,yes,no,cellular,20,nov,138,1,-1,0,unknown,no
2058,50,management,divorced,tertiary,no,201,yes,no,cellular,24,jul,248,1,-1,0,unknown,no
...

Examples

Basic Applications

In this section, we are going to provide an example to show how to train a model using the experiment. In this example, we use the blood dataset, which is loaded from hypernets.tabular. The columns of this dataset can be shown as follows:

Recency,Frequency,Monetary,Time,Class
2,50,12500,98,1
0,13,3250,28,1
1,16,4000,35,1
2,20,5000,45,1
1,24,6000,77,0
4,4,1000,4,0

...

Create and Run an Experiment

Using the tool make_experiment can create an executable experiment object. The only necessary parameter when using this tool is train_data. Then simply calling the method run of the created experiment object will start training and return a model. Note that if the target column of the data is not y, one needs to manually set it through the parameter target.

An example code:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class')
estimator = experiment.run()
print(estimator)

output:

Pipeline(steps=[('data_clean',
                 DataCleanStep(...),
                ('estimator',
                 GreedyEnsemble(...)])

Training will return a Pipeline while the final returned model is a collection of multiple models.

For training data with file extension .csv or .parquet, the experiment can be created through using the data file path directly and make_experiment will load data as DataFrame automatically. For an example:

from hypergbm import make_experiment

train_data = '/path/to/mydata.csv'
experiment = make_experiment(train_data, target='my_target')
estimator = experiment.run()
print(estimator)

Set the Number of Search Trials

One can set the max search trial number by adjusting max_trials.

The following code sets the max searching time as 3 hours:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', max_trials=100)
estimator = experiment.run()
print(estimator)

Use Cross Validation

Users can apply cross validation in the experiment by manually setting parameter cv. Setting cv as ‘False’ will lead the experiment to avoid using cross validation and apply train_test_split instead. On the other hand, when cv is True, the experiment will use cross validation where the number of folds can be adjusted through the parameter num_folds. The default value of num_folds is 3.

Example code when cv=True:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=True, num_folds=5)
estimator = experiment.run()
print(estimator)

Evaluation dataset

When cv=False, training model will require evaluating its perfomance additionally on evaluation dataset. This can be done by setting eval_data when creating make_experiment. For example:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
from sklearn.model_selection import train_test_split

train_data = dsutils.load_blood()
train_data,eval_data=train_test_split(train_data,test_size=0.3)
experiment = make_experiment(train_data, target='Class', eval_data=eval_data, cv=False)
estimator = experiment.run()
print(estimator)

If the eval_data is not given, the experiment object will split the train_data to get an evaluation dataset, whose size can be adjusted by setting eval_size:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', cv=False, eval_size=0.2)
estimator = experiment.run()
print(estimator)

Set the Evaluation Criterion

The default evaluation criterion of the model when creating an experiment with make_experiment for classification task is accuracy, while the criterion for regression task is rmse. Other criterions can be used by setting reward_metric. For example:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', reward_metric='auc')
estimator = experiment.run()
print(estimator)

Set the Early Stopping

One can set the early stopping strategy with settings of early_stopping_round, early_stopping_time_limit and early_stopping_reward.

The following code sets the max searching time as 3 hours:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', max_trials=300, early_stopping_time_limit=3600 * 3)
estimator = experiment.run()
print(estimator)

Choose a Searcher

HyperGBM performs hyperparameter search with the search algorithms provided by Hypernets, which includes EvolutionSearch, MCTSSearcher, RandomSearcher. One can choose a specific searcher when using make_experiment by setting the parameter searcher.

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', searcher='random')
estimator = experiment.run()
print(estimator)

Furthermore, you can make a new searcher object for experiment, for an example:

from hypergbm import make_experiment
from hypergbm.search_space import search_space_general
from hypernets.searchers import MCTSSearcher
from hypernets.tabular.datasets import dsutils

my_searcher = MCTSSearcher(lambda: search_space_general(n_estimators=100),
                           max_node_space=20,
                           optimize_direction='max')

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', searcher=my_searcher)
estimator = experiment.run()
print(estimator)

Ensemble Models

make_experiment automatically turns on the model ensemble function to get a better model when created. It will ensemble the best 20 models while the number for ensembling can be changed by setting ensemble_size as the following code, where ensemble_size=0 means no ensembling wii be made.

train_data = ...
experiment = make_experiment(train_data, ensemble_size=10, ...)

Change the log level

The progress messages during training can be shown by setting log_level (str or int) to change the log level. Please refer the logging package of python for further details. Besides, more thorough messages will show when verobs is set as 1.

The following codes sets the log level to ‘INFO’:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', log_level='INFO', verbose=1)
estimator = experiment.run()
print(estimator)

Output:

14:24:33 I hypernets.tabular.u._common.py 30 - 2 class detected, {0, 1}, so inferred as a [binary classification] task
14:24:33 I hypergbm.experiment.py 699 - create experiment with ['data_clean', 'drift_detection', 'space_search', 'final_ensemble']
14:24:33 I hypergbm.experiment.py 1262 - make_experiment with train data:(748, 4), test data:None, eval data:None, target:Class
14:24:33 I hypergbm.experiment.py 716 - fit_transform data_clean
14:24:33 I hypergbm.experiment.py 716 - fit_transform drift_detection
14:24:33 I hypergbm.experiment.py 716 - fit_transform space_search
14:24:33 I hypernets.c.meta_learner.py 22 - Initialize Meta Learner: dataset_id:7123e0d8c8bbbac8797ed9e42352dc59
14:24:33 I hypernets.c.callbacks.py 192 - 
Trial No:1
--------------------------------------------------------------
(0) estimator_options.hp_or:                                0
(1) numeric_imputer_0.strategy:                 most_frequent
(2) numeric_scaler_optional_0.hp_opt:                    True


...

14:24:35 I hypergbm.experiment.py 716 - fit_transform final_ensemble
14:24:35 I hypergbm.experiment.py 737 - trained experiment pipeline: ['data_clean', 'estimator']
Pipeline(steps=[('data_clean',
                 DataCleanStep(...),
                ('estimator',
                 GreedyEnsemble(...)

Advanced applications

HyperGBM make_experiment create an instance of CompeteExperiment in Hypernets. There are many advanced features of CompeteExperiment which will be covered in this section.

flowchart LR dc[Data<br/>Cleaning] fg[Feature generation] cd[collinearity detection] dd[Drift detection] fs[Feature selection] s1[Search optimization] pi[2nd-stage<br/>Feature<br/>selection] pl[Pseudo label] s2[2nd-stage<br/>search optimization] em[Model<br/>ensemble] op2[op] subgraph 1st-stage direction LR subgraph op direction TB cd-->dd dd-->fs end fg-->op op-->s1 end subgraph 2nd-stage direction LR subgraph op2 direction TB pi-->pl end op2-->s2 end dc-->1st-stage-->2nd-stage-->em style 2nd-stage stroke:#6666,stroke-width:2px,stroke-dasharray: 5, 5;

Data cleaning

The first step of the CompeteExperiment is to perform data cleaning with DataCleaner in Hypernets. Note that this step can not be disabled but can be adjusted with DataCleaner in the following ways:

  • nan_chars: value or list, (default None), replace some characters with np.nan

  • correct_object_dtype: bool, (default True), whether correct the data types

  • drop_constant_columns: bool, (default True), whether drop constant columns

  • drop_duplicated_columns: bool, (default False), whether delete repeated columns

  • drop_idness_columns: bool, (default True), whether drop id columns

  • drop_label_nan_rows: bool, (default True), whether drop rows with target values np.nan

  • replace_inf_values: (default np.nan), which values to replace np.nan with

  • drop_columns: list, (default None), drop which columns

  • reserve_columns: list, (default None), reserve which columns when performing data cleaning

  • reduce_mem_usage: bool, (default False), whether try to reduce the memory usage

  • int_convert_to: bool, (default ‘float’), transform int to other types,None for no transformation

If nan is represented by ‘\N’ in data,users can replace ‘\N’ back to np.nan when performing data cleaning as follows:

from hypergbm import make_experiment

train_data = ...
experiment = make_experiment(train_data, target='...',
                            data_cleaner_args={'nan_chars': r'\N'})
...

Feature generation

CompeteExperiment is capable of performing feature generation, which can be turned on by setting feature_generation=True when creating experiment with make_experiment. There are several options:

  • feature_generation_continuous_cols:list (default None)), continuous feature, inferring automatically if set as None.

  • feature_generation_categories_cols:list (default None)), categorical feature, need to be set explicitly, CompeteExperiment can not perform automatic inference for this one.

  • feature_generation_datetime_cols:list (default None), datetime feature, inferring automatically if set as None.

  • feature_generation_latlong_cols:list (default None), latitude and longtitude feature, inferring automatically if set as None.

  • feature_generation_text_cols:list (default None), text feature, inferring automatically if set as None.

  • feature_generation_trans_primitives:list (default None), transformations for feature generation, inferring automatically if set as None.

When feature_generation_trans_primitives=None, CompeteExperiment will automatically infer the types used for transforming based on the default features. Specifically, different transformations will be adopted for different types:

  • continuous_cols: None, need to be set explicitly.

  • categories_cols: cross_categorical.

  • datetime_cols: month, week, day, hour, minute, second, weekday, is_weekend.

  • latlong_cols: haversine, geohash

  • text_cols:tfidf

An example code for enabling feature generation:

from hypergbm import make_experiment

train_data = ...
experiment = make_experiment(train_data,
                           feature_generation=True,
                           ...)
...

Please refer to [featuretools](https://docs.featuretools.com/) for more information.

Collinearity detection

There will often be some highly relevant features which are not informative but are more seen as noises. They are not very useful. On the contrary, the dataset will be affected by drifts of these features more heavily.

It is possible to handle these collinear features with CompeteExperiment. This can be simply enabled by setting collinearity_detection=True when creating experiment.

Example code for using collinearity detection

from hypergbm import make_experiment

train_data = ...
experiment = make_experiment(train_data, target='...', collinearity_detection=True)
...

Drift detection

Concept drift is one of the major challenge for machine learning. The model will often perform worse in practice due to the fact that the data distributions will change along with time. To handle this problem, CompeteExeriment adopts Adversarial Validation to detect whether there is any drifted features and drop them to maintain a good performance.

To enable drift detection, one needs to set drift_detection=True when creating experiment and provide test_data.

Relevant parameters:

  • drift_detection_remove_shift_variable : bool, (default=True), whether to detect the stability of every column first.

  • drift_detection_variable_shift_threshold : float, (default=0.7), stability socres higher than this value will be dropped.

  • drift_detection_threshold : float, (default=0.7), detecting scores higher than this value will be dropped.

  • drift_detection_remove_size : float, (default=0.1), ratio of columns to be dropped.

  • drift_detection_min_features : int, (default=10), the minimal number of columns to be reserved.

  • drift_detection_num_folds : int, (default=5), the number of folds for cross validation.

An code example:

from io import StringIO
import pandas as pd
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

test_data = """
Recency,Frequency,Monetary,Time
2,10,2500,64
4,5,1250,23
4,9,2250,46
4,5,1250,23
4,8,2000,40
2,12,3000,82
11,24,6000,64
2,7,1750,46
4,11,2750,61
1,7,1750,57
2,11,2750,79
2,3,750,16
4,5,1250,26
2,6,1500,41
"""

train_data = dsutils.load_blood()
test_df = pd.read_csv(StringIO(test_data))
experiment = make_experiment(train_data, test_data=test_df,
                             drift_detection=True,
                             ...)

...

Feature selection

CompeteExperiment evaluates the importances of features by training a model. Then it chooses the most important ones among them to continue the model training.

To enable feature selection, one needs to set feature_selection=True when creating experiment. Relevant parameters:

  • feature_selection_strategy:str, selection strategies(default threshold), can be chose from threshold, number and quantile.

  • feature_selection_threshold:float, (default 0.1), selection threshold when the strategy is threshold, features with scores higher than this threshold will be selected.

  • feature_selection_quantile:float, (default 0.2), selection threshold when the strategy is quantile, features with scores higher than this threshold will be selected.

  • feature_selection_number:int or float, (default 0.8), selection numbers when the strategy is number.

An example code:

from hypergbm import make_experiment

train_data=...
experiment = make_experiment(train_data,
                             feature_selection=True,
                             feature_selection_strategy='quantile',
                             feature_selection_quantile=0.3,
                             ...)

The second stage feature selection

CompeteExperiment supports continuing data processing with the trained model, which is officially called Two-stage search. There are two types of Two-stage processings supported by CompeteExperiment: Two-stage feature selection and pseudo label which will be covered in the rest of this section.

In CompeteExperiment, the second stage feature selection is to choose models with good performances in the first stage, and use permutation_importance to evaluate them to give better features.

To enable the second stage feature selection, one needs to set feature_reselection=True when creating experiment. Relevant parameters:

  • feature_reselection_estimator_size:int, (default=10), the number of models to be used for evaluating the importances of feature (top n best models in the first stage).

  • feature_reselection_strategy:str, selection strategy(default threshold), available selection strategies include threshold, number, quantile.

  • feature_reselection_threshold:float, (default 1e-5), threshold when the selection strategy is threshold, importance scores higher than this values will be choosed.

  • feature_reselection_quantile:float, (default 0.2), threshold when the selection strategy is quantile, importance scores higher than this values will be choosed.

  • feature_reselection_number:int or float, (default 0.8), the number of features to be selected when the strategy is number.

An example code:

from hypergbm import make_experiment

train_data=...
experiment = make_experiment(train_data,
                             feature_reselection=True,
                             ...)

Please refer to [scikit-learn](https://scikit-learn.org/stable/modules/permutation_importance.html) for more information about permutation_importance.

Pseudo label

Pseudo label is a kind of semi-supervised machine learning method. It will assign labels predicted by the model trained in the first stage to some examples in test data. Then examples with higher confidence values than a threshold will be added into the trainig set to train the model again.

To enable feature selection, one needs to set pseudo_labeling=True when creating experiment. Relevant parameters:

  • pseudo_labeling_strategy:str, selection strategy(default threshold), available strategies include threshold, number and quantile.

  • pseudo_labeling_proba_threshold:float(default 0.8), threshold when the selection strategy is threshold, confidence scores higher than this values will be choosed.

  • pseudo_labeling_proba_quantile:float(default 0.8), threshold when the selection strategy is quantile, importance scores higher than this values will be choosed.

  • pseudo_labeling_sample_number:float(0.0~1.0) or int (default 0.2), the number of top features to be selcected when the strategy is number.

  • pseudo_labeling_resplit:bool(default=False), whether split training and validation set after adding pseudo label examples. If set as False, all examples with pseudo labels will be added into training set to train the model. Otherwise, experiment will perform training set and validation set splitting for the new dataset with pseudo labels.

An example code:

from hypergbm import make_experiment

train_data=...
test_data=...
experiment = make_experiment(train_data,
                             test_data=test_data,
                             pseudo_labeling=True,
                             ...)

Note: Pseudo label is only valid for classification task.

Handling Imbalanced Data

Imbalanced data problem is one of the most often encountered challanges in practice, which will usually leads to barely satisfactory models. To alleviate this problem, HyperGBM supports two solutions as follows:

Adopt ClassWeight

When building the model such as LightGBM, one first calculates the data distributions and assign different weights to different classes according to their distributions when computing loss. To enable ClassWeight algorithm, one can simply set the parameter ``class_balancing=’ClassWeight’when usingmake_experiment`.

from hypergbm import make_experiment

train_data = ...
experiment = make_experiment(train_data,
                             class_balancing='ClassWeight',
                             ...)

UnderSampling and OverSampling

The most common approach to handle the data imbalance problem is to modify the data distribution to get a more balanced dataset. Then one trains the model with the modified dataset. Currently, HyperGBM supports several resampling strategies including RandomOverSampler, SMOTE, ADASYN, RandomUnderSampler, NearMiss, TomekLinks, and EditedNearestNeighbours. To enable different sampling methods, one only needs to set class_balancing='<selected strategy>' when using make_experiment. Please refer to the following example:

To enable UnderSampling and OverSampling, set class_balancing=‘<strategy>’ when creating experiment. An example code is as follows:

from hypergbm import make_experiment

train_data = ...
experiment = make_experiment(train_data,
                             class_balancing='SMOTE',
                             ...)

For more information regarding these sampling methods, please see imbalanced-learn.

Search Space

When not defined explicitly, make_experiment will use search_space_general as its search space, which is defined as follows

search_space_general = GeneralSearchSpaceGenerator(n_estimators=200)

Define Search Space

To use a specific search space, one can change the parameter search_space when calling make_experiment. Taking defining the max_depth as 20 for xgboost as an example:

from hypergbm import make_experiment
from hypergbm.search_space import GeneralSearchSpaceGenerator

my_search_space = \
    GeneralSearchSpaceGenerator(n_estimators=200, xgb_init_kwargs={'max_depth': 20})

train_data = ...

experiment = make_experiment(train_data,
                             search_space=my_search_space,
                             ...)

If you want to use searchable parameters, we recommend doing this by defining a subclass of GeneralSearchSpaceGenerator. For example, if we want the algorithm to search among 3 choices of the max_depth for xgboost:

from hypergbm import make_experiment
from hypergbm.search_space import GeneralSearchSpaceGenerator
from hypernets.core.search_space import Choice

class MySearchSpace(GeneralSearchSpaceGenerator):
    @property
    def default_xgb_init_kwargs(self):
        return { **super().default_xgb_init_kwargs,
                'max_depth': Choice([10, 20 ,30]),
        }

my_search_space = MySearchSpace()
train_data = ...

experiment = make_experiment(train_data, 
                             search_space=my_search_space,
                             ...)

Support Machine Learning Models

HyperGBM has already supported XGBoost, LightGBM, CatBoost, and HistGradientBoosting. They are taken as components of the Search Space to be searched for training a model. Supporting other machine learning algorithms can be done by following 3 steps:

  • Encapsulating your algorithms as a subclass of HyperEstimator

  • Add the encapsulated algorithms to the search sapce and define the search parameters

  • Use your Search Space in make_experiment

Please see the following example:

from sklearn import svm

from hypergbm import make_experiment
from hypergbm.estimators import HyperEstimator
from hypergbm.search_space import GeneralSearchSpaceGenerator
from hypernets.core.search_space import Choice, Int, Real
from hypernets.tabular.datasets import dsutils


class SVMEstimator(HyperEstimator):
    def __init__(self, fit_kwargs, C=1.0, kernel='rbf', gamma='auto', degree=3, random_state=666, probability=True,
                 decision_function_shape=None, space=None, name=None, **kwargs):
        if C is not None:
            kwargs['C'] = C
        if kernel is not None:
            kwargs['kernel'] = kernel
        if gamma is not None:
            kwargs['gamma'] = gamma
        if degree is not None:
            kwargs['degree'] = degree
        if random_state is not None:
            kwargs['random_state'] = random_state
        if decision_function_shape is not None:
            kwargs['decision_function_shape'] = decision_function_shape
        kwargs['probability'] = probability
        HyperEstimator.__init__(self, fit_kwargs, space, name, **kwargs)

    def _build_estimator(self, task, kwargs):
        if task == 'regression':
            hsvm = SVMRegressorWrapper(**kwargs)
        else:
            hsvm = SVMClassifierWrapper(**kwargs)
        hsvm.__dict__['task'] = task
        return hsvm


class SVMClassifierWrapper(svm.SVC):
    def fit(self, X, y=None, **kwargs):
        return super().fit(X, y)


class SVMRegressorWrapper(svm.SVC):
    def fit(self, X, y=None, **kwargs):
        return super().fit(X, y)


class GeneralSearchSpaceGeneratorPlusSVM(GeneralSearchSpaceGenerator):
    def __init__(self, enable_svm=True, **kwargs):
        super(GeneralSearchSpaceGeneratorPlusSVM, self).__init__(**kwargs)
        self.enable_svm = enable_svm

    @property
    def default_svm_init_kwargs(self):
        return {
            'C': Real(0.1, 5, 0.1),
            'kernel': Choice(['rbf', 'poly', 'sigmoid']),
            'degree': Int(1, 5),
            'gamma': Real(0.0001, 5, 0.0002)
        }

    @property
    def default_svm_fit_kwargs(self):
        return {}

    @property
    def estimators(self):
        r = super().estimators
        if self.enable_svm:
            r['svm'] = (SVMEstimator, self.default_svm_init_kwargs, self.default_svm_fit_kwargs)
        return r


my_search_space = GeneralSearchSpaceGeneratorPlusSVM()

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class',
                             search_space=my_search_space)
estimator = experiment.run()
print(estimator)

Distributed training

Quick Experiment

HyperGBM supports performing distributed training with Dask. Before training, the Dask collections should be deployed and Client object of Dask should be initialized. Training data file with extensions such as csv and parquet can be adopted by make_experiment directly with the file path. And make_experiment will automatically load the data as DataFrame object of Dask if the environment of Dask is detected.

Suppose that your training data file is ‘/opt/data/my_data.csv’, the following code shows how to load data for a single node:

from dask.distributed import LocalCluster, Client

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils


def train():
    cluster = LocalCluster(processes=True)
    client = Client(cluster)

    train_data = '/opt/data/my_data.csv'

    experiment = make_experiment(train_data, target='...')
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()

We recommend spliting the data to multiple files and save them in a single location such as ‘/opt/data/my_data’ for large-scale data to speed up the loading process. After doing this, one can create an exmperiment with the splited files:

from dask.distributed import LocalCluster, Client

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils


def train():
    cluster = LocalCluster(processes=True)
    client = Client(cluster)

    train_data = '/opt/data/my_data/*.parquet'

    experiment = make_experiment(train_data, target='...')
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()

Please also refer to the official documents of Dask Create DataFrames for further details on how to use Dask DataFrame.

Define Search Space

When running an experiment in the Dask environment, the Transformer and Estimator used in the search space need to support Dask data type. Users can define new search space based on the default search space of HyperGBM which supports Dask.

An example code:

from dask import dataframe as dd
from dask.distributed import LocalCluster, Client

from hypergbm import make_experiment
from hypergbm.dask.search_space import search_space_general
from hypernets.tabular.datasets import dsutils


def my_search_space():
    return search_space_general(n_estimators=100)


def train():
    cluster = LocalCluster(processes=False)
    client = Client(cluster)

    train_data = dd.from_pandas(dsutils.load_blood(), npartitions=1)

    experiment = make_experiment(train_data, target='Class', searcher='mcts', search_space=my_search_space)
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()


How-To

How to install shap on centos7?

  1. Install system dependencies

    yum install epel-release centos-release-scl -y  && yum clean all && yum make cache # llvm9.0 is in epel, gcc9 in scl
    yum install -y llvm9.0 llvm9.0-devel python36-devel devtoolset-9-gcc devtoolset-9-gcc-c++ make cmake 
    
  2. Configure installing environment

    whereis llvm-config-9.0-64  # find your `llvm-config` path
    # llvm-config-9: /usr/bin/llvm-config-9.0-64
    
    export LLVM_CONFIG=/usr/bin/llvm-config-9.0-64  # set to your path
    scl enable devtoolset-9 bash
    
  3. Install shap

    pip3 -v install numpy==1.19.1  # prepare shap dependency
    pip3 -v install scikit-learn==0.23.1  # prepare shap dependency
    pip3 -v install shap==0.28.5
    

Released Notes

Releasing history:

Version 0.2.3

We add the following new features to this version:

  • Data cleaning

    • Support automatically recognizing categorical columns among features with numerical datatypes

    • Support performing data cleaning with several specific columns reserved

  • Feature generation

    • Support datatime, text and Latitude and Longitude features

    • Support distributed training

  • Modelling algorithms

    • XGBoost:Change distributed training from dask_xgboost to xgboost.dask to be compatible with official website of XGBoost

    • LightGBM:Support distributed trianing for more machines

  • Model training

    • Support reproducing the searching process

    • Support searching with low fidelity

    • Predicting learning curves based on statistical information

    • Support hyperparameter optimizing without making modification

    • Time limit of EarlyStopping is now adjusted to the whole experiment life-cycle

    • Support defining pos_label

    • eval-set supports Dask dataset for distributed training

    • Optimizing the cache strategy for model training

  • Search algorithms

    • Add GridSearch algorithm

    • Add Playback algorithm

  • Advanced Features

    • Add feature selection with various strategies for the first stage

    • Feature selection for the second stage now supports more strategies

    • Pseudo-label supports various data selection strategies and multi-class classification

    • Optimizing performance of concepts drift handling

    • Add cache mechanism during processing of advanced features

  • Visualization

    • Experiment information visualization

    • Training process visualization

  • Command Line tool

    • Most features of experiments for model training are now supported by command line tools

    • Support model evaluating

    • Support model predicting

Version 0.2.2

We add the following new features to this version:

Feature engineering

  • Feature generating

  • Feature dimension reduction

Data cleaning

  • Missing characters handling

  • Column types correction

  • Constant columns cleaning

  • Repeat columns cleaning

  • Deleating examples with missing targets

  • Replacing invalid values

  • id columns cleaning

Dataset splitting

  • Adversarial validation

Modelling algorithms

  • XGBoost

  • Catboost

  • LightGBM

  • HistGridientBoosting

Model training

  • Automatic task inferencing

  • Command line tools

Evaluation methods

  • Cross-Validation

  • Train-Validation-Holdout

Search Algorithms

  • Monte-Carlo Tree search

  • Evolution algorithms

  • Random search

Imbalanced data handling

  • Class Weight

  • Under-sampling

    • Near miss

    • Tomeks links

    • Random

  • Over-sampling

    • SMOTE

    • ADASYN

    • Random

Early-stopping strategy

  • stopping after n times searching without improving

  • stopping after using a maximal time

  • stopping after achieving expected performance

Advanced Features

  • Two-stage search

    • Pseudo-label

    • Feature selection

  • Concepts drift handling

  • Model ensemble

ces and tables