DataCanvas
HyperGBM is an open source project created by DataCanvas .
Outline
About HyperGBM
HyperGBM is a Full-Pipeline Automated Machine Learning Tool with functions ranging from data cleaning, preprocessing and feature engineering to model selection and hyperparameter tuning. It is an advanced AutoML tool for tabular data.
While a lot of AutoML tools mainly focus on the hyperparameter tuning of different algorithms, HyperGBM designs a high-level search space to include almost all components of machine learning modelling into it, such as data cleaning and algorithm optimizing. This end-to-end optimization approach is more close to a SDP(Sequential Decision Process). Therefore, combined with a meta-learner, HyperGBM adopts advanced algorithms such as reinforcement learning and Monte-Carlo tree search to solve the full-pipeline optimization problem more effectively. These strategies are proven to be effective in practice.
For the machine leanring models, HyperGBM uses popular gradient-boosting tree models ranging from XGBoost, LightGBM and HistGradientBoosting. Besides, HyperGBM also involves many advanced features of CompeteExperiment from Hypernets in data cleaning, feature engineering and model ensemble.
The optimization algorithms, representations of search space and CompeteExperiment are based on Hypernets.
Features
There are three types of running HyperGBM:
Single node:running in a single machine and using Pandas and Numpy datatype
Distributed with single node:running in a single machine and using Dask datatype which requires creating Dask collections before using HyperGBM
Distributed with multi nodes:running in multiple machines and using Dask datatype which requires creating Dask collections to manage resources for multiple machines before using HyperGBM
The supported features are different for different running types as in the following table:
Features |
Single Machine |
Distributed with single node |
Distributed with multi nodes |
|
---|---|---|---|---|
Data Cleaning |
Empty characters handling |
√ |
√ |
√ |
Recognizing columns types automatically |
√ |
√ |
√ |
|
Columns types correction |
√ |
√ |
√ |
|
Constant columns cleaning |
√ |
√ |
√ |
|
Repeated columns cleaning |
√ |
√ |
√ |
|
Deleting examples without targets |
√ |
√ |
√ |
|
Illegal characters replacing |
√ |
√ |
√ |
|
id columns cleaning |
√ |
√ |
√ |
|
Dataset splitting |
Splitting by ratio |
√ |
√ |
√ |
Adversarial validation |
√ |
√ |
√ |
|
Feature engineering |
Feature generation |
√ |
√ |
√ |
Feature dimension reduction |
√ |
√ |
√ |
|
Data preprocessing |
SimpleImputer |
√ |
√ |
√ |
SafeOrdinalEncoder |
√ |
√ |
√ |
|
SafeOneHotEncoder |
√ |
√ |
√ |
|
TruncatedSVD |
√ |
√ |
√ |
|
StandardScaler |
√ |
√ |
√ |
|
MinMaxScaler |
√ |
√ |
√ |
|
MaxAbsScaler |
√ |
√ |
√ |
|
RobustScaler |
√ |
√ |
√ |
|
Imbalanced data handling |
ClassWeight |
√ |
√ |
√ |
UnderSampling(Nearmiss,Tomekslinks,Random) |
√ |
|||
OverSampling(SMOTE,ADASYN,Random) |
√ |
|||
Search algorithms |
MCTS |
√ |
√ |
√ |
Evolution |
√ |
√ |
√ |
|
Random search |
√ |
√ |
√ |
|
Play back |
√ |
√ |
√ |
|
Early stopping |
time limit |
√ |
√ |
√ |
no improvements are made after n trials |
√ |
√ |
√ |
|
expected_reward |
√ |
√ |
√ |
|
trail discriminator |
√ |
√ |
√ |
|
Modeling algorithms |
XGBoost |
√ |
√ |
√ |
LightGBM |
√ |
√ |
√ |
|
CatBoost |
√ |
√ |
||
HistGridientBoosting |
√ |
|||
Evaluation |
Cross-Validation |
√ |
√ |
√ |
Train-Validation-Holdout |
√ |
√ |
√ |
|
Advanced features |
Automatica task type inference |
√ |
√ |
√ |
Collinearity detection |
√ |
√ |
√ |
|
Data drift detection |
√ |
√ |
√ |
|
Feature selection |
√ |
√ |
√ |
|
Feature selection(Two-stage) |
√ |
√ |
√ |
|
Pseudo label(Two-stage) |
√ |
√ |
√ |
|
Pre-searching with UnderSampling |
√ |
√ |
√ |
|
Model ensemble |
√ |
√ |
√ |
Installing HyperGBM
We recommend installing HyperGBM with pip
. Installing and using HyperGBM in a Docker container are also possible if you have a Docker environment.
pip
Python version 3.6 or above is necessary before installing HyperGBM. Here is how to use pip to install HyperGBM:
pip install hypergbm
Note that HyperGBM and its necessary packages will be installed at this time. If you want to install HyperGBM along with dependent packages such as shap
, the following way is recommended:
pip install hypergbm[all]
Docker
It is possible to use HyperGBM in a Docker container. To do this, users can install HyperGBM with pip
in the Dockerfile. We also publish a mirror image in Docker Hub which can be downloaded directly and includes the following components:
Python3.7
HyperGBM and its dependent packages
JupyterLab
Download the mirror image:
docker pull datacanvas/hypergbm
Use the mirror image:
docker run -ti -e NotebookToken="your-token" -p 8888:8888 datacanvas/hypergbm
Then one can visit http://<your-ip>:8888
in the browser and type in the default token to start.
Quick Start
We will introduce main features of HyperGBM in this section and assuming that you have knowledge of machine learning such as loading data and training a model. If you have not completed installation, please refer to [Installation](installation.md) to install HyperGBM. You can use HyperGBM with command line tools and Python.
Use HyperGBM with Python
HyperGBM is developed with Python. We recommend using the Python tool make_experiment to create experiment and train the model.
The basic steps for training the model with make_experiment are as follows:
Prepare the dataset(pandas or dask DataFrame)
Create experiment with make_experiment
Call the .run() method of experiment to performing training and get the model
Predict with trained model or save it with the Python tool pickle
Prepare the dataset
Both pandas and dask can be loaded depending on your task types to get DataFrame for training the model.
Taking loading the sklearn dataset breast_cancer as an example,one can get the dataset by following several procedures:
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
X,y = datasets.load_breast_cancer(as_frame=True,return_X_y=True)
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_state=335)
train_data = pd.concat([X_train,y_train],axis=1)
where train_data is used for model trianing while X_test are y_test used for evaluating the model.
Create experiment with make_experiment
Users can creating experiment for the prepared dataset and start training the model following procedures below:
from hypergbm import make_experiment
experiment = make_experiment(train_data, target='target', reward_metric='precision')
estimator = experiment.run()
where estimator is the trianed model.
Save the model
It is recommended to save the model with pickle:
import pickle
with open('model.pkl','wb') as f:
pickle.dump(estimator, f)
Evaluate the model
The model can be evaluated with tools provided by sklearn:
from sklearn.metrics import classification_report
y_pred=estimator.predict(X_test)
print(classification_report(y_test, y_pred, digits=5))
output:
precision recall f1-score support
0 0.96429 0.93103 0.94737 58
1 0.96522 0.98230 0.97368 113
accuracy 0.96491 171
macro avg 0.96475 0.95667 0.96053 171
weighted avg 0.96490 0.96491 0.96476 171
More info:
Please refer to the docstring of make_experiment for more information about it:
print(make_experiment.__doc__)
If you are using Notebook or IPython, the following code can provide more information about make_experiment:
make_experiment?
Use HyperGBM with Command Line
HyperGBM offers command line tool hypergbm
to perform model training, evaluation and prediction. The following code enables the user to view command line help:
hypergm -h
usage: hypergbm [-h] [--log-level LOG_LEVEL] [-error] [-warn] [-info] [-debug]
[--verbose VERBOSE] [-v] [--enable-dask ENABLE_DASK] [-dask]
[--overload OVERLOAD]
{train,evaluate,predict} ...
hypergbm
offers three commands: train
, evaluate
and predict
. To get more information, one can use hypergbm <command> -h
:
hypergbm train -h
usage: hypergbm train [-h] --train-data TRAIN_DATA [--eval-data EVAL_DATA]
[--test-data TEST_DATA]
[--train-test-split-strategy {None,adversarial_validation}]
[--target TARGET]
[--task {binary,multiclass,regression}]
[--max-trials MAX_TRIALS] [--reward-metric METRIC]
[--cv CV] [-cv] [-cv-] [--cv-num-folds NUM_FOLDS]
[--pos-label POS_LABEL]
...
Prepare the Data
When training model with command line, the training data must be saved in a file of form of csv or parque. The returned model is in the form of pickle whoes file ends with .pkl
.
For an example of training Bank Marketing data, one can prepare the data as follows:
from hypernets.tabular.datasets import dsutils
from sklearn.model_selection import train_test_split
df = dsutils.load_bank().head(10000)
df_train, df_test = train_test_split(df, test_size=0.3, random_state=9527)
df_train.to_csv('bank_train.csv', index=None)
df_test.to_csv('bank_eval.csv', index=None)
df_test.pop('y')
df_test.to_csv('bank_to_pred.csv', index=None)
where
bank_train.csv is used for training
bank_eval.csv is used for evaluating the model
bank_to_pred.csv is data without targets for predicting
Train the Model
After preparing the data, one can also perform model training with command line:
hypergbm train --train-data bank_train.csv --target y --model-file model.pkl
one will see model.pkl
after this process
ls -l model.pkl
rw-rw-r-- 1 xx xx 9154959 17:09 model.pkl
Evaluate the Model
The trained model can be evaluated with the evaluation data:
hypergbm evaluate --model model.pkl --data bank_eval.csv --metric f1 recall auc
{'f1': 0.7993779160186626, 'recall': 0.7099447513812155, 'auc': 0.9705420982746849}
Predict the Test Data
The trained model can be used for predicting a given data as follows:
hypergbm predict --model model.pkl --data bank_to_pred.csv --output bank_output.csv
where the predicting result will be saved to bank_output.csv
.
To add other columns of your predicted data to the above file, one can use the parameter --with-data
explicitly:
hypergbm predict --model model.pkl --data bank_to_pred.csv --output bank_output.csv --with-data id
head bank_output.csv
id,y
1563,no
124,no
218,no
463,no
...
Furthermore, including all columns of the test data besides the predicting results to the file bank_output.csv
can be done by setting --with-data
as “*”:
hypergbm predict --model model.pkl --data bank_to_pred.csv --output bank_output.csv --with-data '*'
head bank_output.csv
id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
1563,55,entrepreneur,married,secondary,no,204,no,no,cellular,14,jul,455,13,-1,0,unknown,no
124,51,management,single,tertiary,yes,-55,yes,no,cellular,11,may,281,2,266,6,failure,no
218,49,blue-collar,married,primary,no,305,yes,yes,telephone,10,jul,834,10,-1,0,unknown,no
463,35,blue-collar,divorced,secondary,no,3102,yes,no,cellular,20,nov,138,1,-1,0,unknown,no
2058,50,management,divorced,tertiary,no,201,yes,no,cellular,24,jul,248,1,-1,0,unknown,no
...
Examples
Basic Applications
In this section, we are going to provide an example to show how to train a model using the experiment. In this example, we use the blood
dataset, which is loaded from hypernets.tabular
. The columns of this dataset can be shown as follows:
Recency,Frequency,Monetary,Time,Class
2,50,12500,98,1
0,13,3250,28,1
1,16,4000,35,1
2,20,5000,45,1
1,24,6000,77,0
4,4,1000,4,0
...
Create and Run an Experiment
Using the tool make_experiment
can create an executable experiment object. The only necessary parameter when using this tool is train_data
. Then simply calling the method run
of the created experiment object will start training and return a model. Note that if the target column of the data is not y
, one needs to manually set it through the parameter target
.
An example code:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class')
estimator = experiment.run()
print(estimator)
output:
Pipeline(steps=[('data_clean',
DataCleanStep(...),
('estimator',
GreedyEnsemble(...)])
Training will return a Pipeline while the final returned model is a collection of multiple models.
For training data with file extension .csv or .parquet, the experiment can be created through using the data file path directly and make_experiment
will load data as DataFrame automatically. For an example:
from hypergbm import make_experiment
train_data = '/path/to/mydata.csv'
experiment = make_experiment(train_data, target='my_target')
estimator = experiment.run()
print(estimator)
Set the Number of Search Trials
One can set the max search trial number by adjusting max_trials
.
The following code sets the max searching time as 3 hours:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', max_trials=100)
estimator = experiment.run()
print(estimator)
Use Cross Validation
Users can apply cross validation in the experiment by manually setting parameter cv
. Setting cv
as ‘False’ will lead the experiment to avoid using cross validation and apply train_test_split instead. On the other hand, when cv
is True
, the experiment will use cross validation where the number of folds can be adjusted through the parameter num_folds
. The default value of num_folds
is 3.
Example code when cv=True
:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=True, num_folds=5)
estimator = experiment.run()
print(estimator)
Evaluation dataset
When cv=False
, training model will require evaluating its perfomance additionally on evaluation dataset. This can be done by setting eval_data
when creating make_experiment
. For example:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
from sklearn.model_selection import train_test_split
train_data = dsutils.load_blood()
train_data,eval_data=train_test_split(train_data,test_size=0.3)
experiment = make_experiment(train_data, target='Class', eval_data=eval_data, cv=False)
estimator = experiment.run()
print(estimator)
If the eval_data
is not given, the experiment object will split the train_data
to get an evaluation dataset, whose size can be adjusted by setting eval_size
:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=False, eval_size=0.2)
estimator = experiment.run()
print(estimator)
Set the Evaluation Criterion
The default evaluation criterion of the model when creating an experiment with make_experiment
for classification task is accuracy
, while the criterion for regression task is rmse
. Other criterions can be used by setting reward_metric
. For example:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', reward_metric='auc')
estimator = experiment.run()
print(estimator)
Set the Early Stopping
One can set the early stopping strategy with settings of early_stopping_round
, early_stopping_time_limit
and early_stopping_reward
.
The following code sets the max searching time as 3 hours:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', max_trials=300, early_stopping_time_limit=3600 * 3)
estimator = experiment.run()
print(estimator)
Choose a Searcher
HyperGBM performs hyperparameter search with the search algorithms provided by Hypernets, which includes EvolutionSearch, MCTSSearcher, RandomSearcher. One can choose a specific searcher when using make_experiment
by setting the parameter searcher
.
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', searcher='random')
estimator = experiment.run()
print(estimator)
Furthermore, you can make a new searcher object for experiment, for an example:
from hypergbm import make_experiment
from hypergbm.search_space import search_space_general
from hypernets.searchers import MCTSSearcher
from hypernets.tabular.datasets import dsutils
my_searcher = MCTSSearcher(lambda: search_space_general(n_estimators=100),
max_node_space=20,
optimize_direction='max')
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', searcher=my_searcher)
estimator = experiment.run()
print(estimator)
Ensemble Models
make_experiment
automatically turns on the model ensemble function to get a better model when created. It will ensemble the best 20 models while the number for ensembling can be changed by setting ensemble_size
as the following code, where ensemble_size=0
means no ensembling wii be made.
train_data = ...
experiment = make_experiment(train_data, ensemble_size=10, ...)
Change the log level
The progress messages during training can be shown by setting log_level
(str
or int
) to change the log level. Please refer the logging
package of python for further details. Besides, more thorough messages will show when verobs
is set as 1
.
The following codes sets the log level to ‘INFO’:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', log_level='INFO', verbose=1)
estimator = experiment.run()
print(estimator)
Output:
14:24:33 I hypernets.tabular.u._common.py 30 - 2 class detected, {0, 1}, so inferred as a [binary classification] task
14:24:33 I hypergbm.experiment.py 699 - create experiment with ['data_clean', 'drift_detection', 'space_search', 'final_ensemble']
14:24:33 I hypergbm.experiment.py 1262 - make_experiment with train data:(748, 4), test data:None, eval data:None, target:Class
14:24:33 I hypergbm.experiment.py 716 - fit_transform data_clean
14:24:33 I hypergbm.experiment.py 716 - fit_transform drift_detection
14:24:33 I hypergbm.experiment.py 716 - fit_transform space_search
14:24:33 I hypernets.c.meta_learner.py 22 - Initialize Meta Learner: dataset_id:7123e0d8c8bbbac8797ed9e42352dc59
14:24:33 I hypernets.c.callbacks.py 192 -
Trial No:1
--------------------------------------------------------------
(0) estimator_options.hp_or: 0
(1) numeric_imputer_0.strategy: most_frequent
(2) numeric_scaler_optional_0.hp_opt: True
...
14:24:35 I hypergbm.experiment.py 716 - fit_transform final_ensemble
14:24:35 I hypergbm.experiment.py 737 - trained experiment pipeline: ['data_clean', 'estimator']
Pipeline(steps=[('data_clean',
DataCleanStep(...),
('estimator',
GreedyEnsemble(...)
Advanced applications
HyperGBM make_experiment create an instance of CompeteExperiment in Hypernets. There are many advanced features of CompeteExperiment which will be covered in this section.
Data cleaning
The first step of the CompeteExperiment is to perform data cleaning with DataCleaner in Hypernets. Note that this step can not be disabled but can be adjusted with DataCleaner in the following ways:
nan_chars: value or list, (default None), replace some characters with np.nan
correct_object_dtype: bool, (default True), whether correct the data types
drop_constant_columns: bool, (default True), whether drop constant columns
drop_duplicated_columns: bool, (default False), whether delete repeated columns
drop_idness_columns: bool, (default True), whether drop id columns
drop_label_nan_rows: bool, (default True), whether drop rows with target values np.nan
replace_inf_values: (default np.nan), which values to replace np.nan with
drop_columns: list, (default None), drop which columns
reserve_columns: list, (default None), reserve which columns when performing data cleaning
reduce_mem_usage: bool, (default False), whether try to reduce the memory usage
int_convert_to: bool, (default ‘float’), transform int to other types,None for no transformation
If nan is represented by ‘\N’ in data,users can replace ‘\N’ back to np.nan when performing data cleaning as follows:
from hypergbm import make_experiment
train_data = ...
experiment = make_experiment(train_data, target='...',
data_cleaner_args={'nan_chars': r'\N'})
...
Feature generation
CompeteExperiment is capable of performing feature generation, which can be turned on by setting feature_generation=True when creating experiment with make_experiment. There are several options:
feature_generation_continuous_cols:list (default None)), continuous feature, inferring automatically if set as None.
feature_generation_categories_cols:list (default None)), categorical feature, need to be set explicitly, CompeteExperiment can not perform automatic inference for this one.
feature_generation_datetime_cols:list (default None), datetime feature, inferring automatically if set as None.
feature_generation_latlong_cols:list (default None), latitude and longtitude feature, inferring automatically if set as None.
feature_generation_text_cols:list (default None), text feature, inferring automatically if set as None.
feature_generation_trans_primitives:list (default None), transformations for feature generation, inferring automatically if set as None.
When feature_generation_trans_primitives=None, CompeteExperiment will automatically infer the types used for transforming based on the default features. Specifically, different transformations will be adopted for different types:
continuous_cols: None, need to be set explicitly.
categories_cols: cross_categorical.
datetime_cols: month, week, day, hour, minute, second, weekday, is_weekend.
latlong_cols: haversine, geohash
text_cols:tfidf
An example code for enabling feature generation:
from hypergbm import make_experiment
train_data = ...
experiment = make_experiment(train_data,
feature_generation=True,
...)
...
Please refer to [featuretools](https://docs.featuretools.com/) for more information.
Collinearity detection
There will often be some highly relevant features which are not informative but are more seen as noises. They are not very useful. On the contrary, the dataset will be affected by drifts of these features more heavily.
It is possible to handle these collinear features with CompeteExperiment. This can be simply enabled by setting collinearity_detection=True when creating experiment.
Example code for using collinearity detection
from hypergbm import make_experiment
train_data = ...
experiment = make_experiment(train_data, target='...', collinearity_detection=True)
...
Drift detection
Concept drift is one of the major challenge for machine learning. The model will often perform worse in practice due to the fact that the data distributions will change along with time. To handle this problem, CompeteExeriment adopts Adversarial Validation to detect whether there is any drifted features and drop them to maintain a good performance.
To enable drift detection, one needs to set drift_detection=True when creating experiment and provide test_data.
Relevant parameters:
drift_detection_remove_shift_variable : bool, (default=True), whether to detect the stability of every column first.
drift_detection_variable_shift_threshold : float, (default=0.7), stability socres higher than this value will be dropped.
drift_detection_threshold : float, (default=0.7), detecting scores higher than this value will be dropped.
drift_detection_remove_size : float, (default=0.1), ratio of columns to be dropped.
drift_detection_min_features : int, (default=10), the minimal number of columns to be reserved.
drift_detection_num_folds : int, (default=5), the number of folds for cross validation.
An code example:
from io import StringIO
import pandas as pd
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
test_data = """
Recency,Frequency,Monetary,Time
2,10,2500,64
4,5,1250,23
4,9,2250,46
4,5,1250,23
4,8,2000,40
2,12,3000,82
11,24,6000,64
2,7,1750,46
4,11,2750,61
1,7,1750,57
2,11,2750,79
2,3,750,16
4,5,1250,26
2,6,1500,41
"""
train_data = dsutils.load_blood()
test_df = pd.read_csv(StringIO(test_data))
experiment = make_experiment(train_data, test_data=test_df,
drift_detection=True,
...)
...
Feature selection
CompeteExperiment evaluates the importances of features by training a model. Then it chooses the most important ones among them to continue the model training.
To enable feature selection, one needs to set feature_selection=True when creating experiment. Relevant parameters:
feature_selection_strategy:str, selection strategies(default threshold), can be chose from threshold, number and quantile.
feature_selection_threshold:float, (default 0.1), selection threshold when the strategy is threshold, features with scores higher than this threshold will be selected.
feature_selection_quantile:float, (default 0.2), selection threshold when the strategy is quantile, features with scores higher than this threshold will be selected.
feature_selection_number:int or float, (default 0.8), selection numbers when the strategy is number.
An example code:
from hypergbm import make_experiment
train_data=...
experiment = make_experiment(train_data,
feature_selection=True,
feature_selection_strategy='quantile',
feature_selection_quantile=0.3,
...)
UnderSampling pre-search
Normally, hyperparameter optimization will utilize all training data. However, this will cost a huge amount of time for a large dataset. To alleviate this problem, one can perform a pre-search with only a part of data to try more model parameters in the same amout of time. Better parameters will then be used for training with the whole data to obtain the optimal parameters.
To enable feature selection, one needs to set down_sample_search=True when creating experiment. Relevant parameters:
down_sample_search_size:int, float(0.0~1.0) or dict (default 0.1), number of examples used for pre-search.
down_sample_search_time_limit:int, (default early_stopping_time_limit*0.33), time limit for pre-search.
down_sample_search_max_trials:int, (default max_trials*3), max trail numbers for pre-search.
An example code:
from hypergbm import make_experiment
train_data=...
experiment = make_experiment(train_data,
down_sample_search=True,
down_sample_search_size=0.2,
...)
The second stage feature selection
CompeteExperiment supports continuing data processing with the trained model, which is officially called Two-stage search. There are two types of Two-stage processings supported by CompeteExperiment: Two-stage feature selection and pseudo label which will be covered in the rest of this section.
In CompeteExperiment, the second stage feature selection is to choose models with good performances in the first stage, and use permutation_importance to evaluate them to give better features.
To enable the second stage feature selection, one needs to set feature_reselection=True when creating experiment. Relevant parameters:
feature_reselection_estimator_size:int, (default=10), the number of models to be used for evaluating the importances of feature (top n best models in the first stage).
feature_reselection_strategy:str, selection strategy(default threshold), available selection strategies include threshold, number, quantile.
feature_reselection_threshold:float, (default 1e-5), threshold when the selection strategy is threshold, importance scores higher than this values will be choosed.
feature_reselection_quantile:float, (default 0.2), threshold when the selection strategy is quantile, importance scores higher than this values will be choosed.
feature_reselection_number:int or float, (default 0.8), the number of features to be selected when the strategy is number.
An example code:
from hypergbm import make_experiment
train_data=...
experiment = make_experiment(train_data,
feature_reselection=True,
...)
Please refer to [scikit-learn](https://scikit-learn.org/stable/modules/permutation_importance.html) for more information about permutation_importance.
Pseudo label
Pseudo label is a kind of semi-supervised machine learning method. It will assign labels predicted by the model trained in the first stage to some examples in test data. Then examples with higher confidence values than a threshold will be added into the trainig set to train the model again.
To enable feature selection, one needs to set pseudo_labeling=True when creating experiment. Relevant parameters:
pseudo_labeling_strategy:str, selection strategy(default threshold), available strategies include threshold, number and quantile.
pseudo_labeling_proba_threshold:float(default 0.8), threshold when the selection strategy is threshold, confidence scores higher than this values will be choosed.
pseudo_labeling_proba_quantile:float(default 0.8), threshold when the selection strategy is quantile, importance scores higher than this values will be choosed.
pseudo_labeling_sample_number:float(0.0~1.0) or int (default 0.2), the number of top features to be selcected when the strategy is number.
pseudo_labeling_resplit:bool(default=False), whether split training and validation set after adding pseudo label examples. If set as False, all examples with pseudo labels will be added into training set to train the model. Otherwise, experiment will perform training set and validation set splitting for the new dataset with pseudo labels.
An example code:
from hypergbm import make_experiment
train_data=...
test_data=...
experiment = make_experiment(train_data,
test_data=test_data,
pseudo_labeling=True,
...)
Note: Pseudo label is only valid for classification task.
Handling Imbalanced Data
Imbalanced data problem is one of the most often encountered challanges in practice, which will usually leads to barely satisfactory models. To alleviate this problem, HyperGBM supports two solutions as follows:
Adopt ClassWeight
When building the model such as LightGBM, one first calculates the data distributions and assign different weights to different classes according to their distributions when computing loss. To enable ClassWeight algorithm, one can simply set the parameter ``class_balancing=’ClassWeight’when using
make_experiment`.
from hypergbm import make_experiment
train_data = ...
experiment = make_experiment(train_data,
class_balancing='ClassWeight',
...)
UnderSampling and OverSampling
The most common approach to handle the data imbalance problem is to modify the data distribution to get a more balanced dataset. Then one trains the model with the modified dataset. Currently, HyperGBM supports several resampling strategies including RandomOverSampler, SMOTE, ADASYN, RandomUnderSampler, NearMiss, TomekLinks, and EditedNearestNeighbours. To enable different sampling methods, one only needs to set class_balancing='<selected strategy>'
when using make_experiment
. Please refer to the following example:
To enable UnderSampling and OverSampling, set class_balancing=‘<strategy>’
when creating experiment. An example code is as follows:
from hypergbm import make_experiment
train_data = ...
experiment = make_experiment(train_data,
class_balancing='SMOTE',
...)
For more information regarding these sampling methods, please see imbalanced-learn.
Search Space
When not defined explicitly, make_experiment
will use search_space_general
as its search space, which is defined as follows
search_space_general = GeneralSearchSpaceGenerator(n_estimators=200)
Define Search Space
To use a specific search space, one can change the parameter search_space
when calling make_experiment
. Taking defining the max_depth
as 20 for xgboost
as an example:
from hypergbm import make_experiment
from hypergbm.search_space import GeneralSearchSpaceGenerator
my_search_space = \
GeneralSearchSpaceGenerator(n_estimators=200, xgb_init_kwargs={'max_depth': 20})
train_data = ...
experiment = make_experiment(train_data,
search_space=my_search_space,
...)
If you want to use searchable parameters, we recommend doing this by defining a subclass of GeneralSearchSpaceGenerator
. For example, if we want the algorithm to search among 3 choices of the max_depth
for xgboost
:
from hypergbm import make_experiment
from hypergbm.search_space import GeneralSearchSpaceGenerator
from hypernets.core.search_space import Choice
class MySearchSpace(GeneralSearchSpaceGenerator):
@property
def default_xgb_init_kwargs(self):
return { **super().default_xgb_init_kwargs,
'max_depth': Choice([10, 20 ,30]),
}
my_search_space = MySearchSpace()
train_data = ...
experiment = make_experiment(train_data,
search_space=my_search_space,
...)
Support Machine Learning Models
HyperGBM has already supported XGBoost, LightGBM, CatBoost, and HistGradientBoosting. They are taken as components of the Search Space to be searched for training a model. Supporting other machine learning algorithms can be done by following 3 steps:
Encapsulating your algorithms as a subclass of HyperEstimator
Add the encapsulated algorithms to the search sapce and define the search parameters
Use your Search Space in
make_experiment
Please see the following example:
from sklearn import svm
from hypergbm import make_experiment
from hypergbm.estimators import HyperEstimator
from hypergbm.search_space import GeneralSearchSpaceGenerator
from hypernets.core.search_space import Choice, Int, Real
from hypernets.tabular.datasets import dsutils
class SVMEstimator(HyperEstimator):
def __init__(self, fit_kwargs, C=1.0, kernel='rbf', gamma='auto', degree=3, random_state=666, probability=True,
decision_function_shape=None, space=None, name=None, **kwargs):
if C is not None:
kwargs['C'] = C
if kernel is not None:
kwargs['kernel'] = kernel
if gamma is not None:
kwargs['gamma'] = gamma
if degree is not None:
kwargs['degree'] = degree
if random_state is not None:
kwargs['random_state'] = random_state
if decision_function_shape is not None:
kwargs['decision_function_shape'] = decision_function_shape
kwargs['probability'] = probability
HyperEstimator.__init__(self, fit_kwargs, space, name, **kwargs)
def _build_estimator(self, task, kwargs):
if task == 'regression':
hsvm = SVMRegressorWrapper(**kwargs)
else:
hsvm = SVMClassifierWrapper(**kwargs)
hsvm.__dict__['task'] = task
return hsvm
class SVMClassifierWrapper(svm.SVC):
def fit(self, X, y=None, **kwargs):
return super().fit(X, y)
class SVMRegressorWrapper(svm.SVC):
def fit(self, X, y=None, **kwargs):
return super().fit(X, y)
class GeneralSearchSpaceGeneratorPlusSVM(GeneralSearchSpaceGenerator):
def __init__(self, enable_svm=True, **kwargs):
super(GeneralSearchSpaceGeneratorPlusSVM, self).__init__(**kwargs)
self.enable_svm = enable_svm
@property
def default_svm_init_kwargs(self):
return {
'C': Real(0.1, 5, 0.1),
'kernel': Choice(['rbf', 'poly', 'sigmoid']),
'degree': Int(1, 5),
'gamma': Real(0.0001, 5, 0.0002)
}
@property
def default_svm_fit_kwargs(self):
return {}
@property
def estimators(self):
r = super().estimators
if self.enable_svm:
r['svm'] = (SVMEstimator, self.default_svm_init_kwargs, self.default_svm_fit_kwargs)
return r
my_search_space = GeneralSearchSpaceGeneratorPlusSVM()
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class',
search_space=my_search_space)
estimator = experiment.run()
print(estimator)
Distributed training
Quick Experiment
HyperGBM supports performing distributed training with Dask. Before training, the Dask collections should be deployed and Client
object of Dask should be initialized. Training data file with extensions such as csv and parquet can be adopted by make_experiment
directly with the file path. And make_experiment
will automatically load the data as DataFrame object of Dask if the environment of Dask is detected.
Suppose that your training data file is ‘/opt/data/my_data.csv’, the following code shows how to load data for a single node:
from dask.distributed import LocalCluster, Client
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
def train():
cluster = LocalCluster(processes=True)
client = Client(cluster)
train_data = '/opt/data/my_data.csv'
experiment = make_experiment(train_data, target='...')
estimator = experiment.run()
print(estimator)
if __name__ == '__main__':
train()
We recommend spliting the data to multiple files and save them in a single location such as ‘/opt/data/my_data’ for large-scale data to speed up the loading process. After doing this, one can create an exmperiment with the splited files:
from dask.distributed import LocalCluster, Client
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
def train():
cluster = LocalCluster(processes=True)
client = Client(cluster)
train_data = '/opt/data/my_data/*.parquet'
experiment = make_experiment(train_data, target='...')
estimator = experiment.run()
print(estimator)
if __name__ == '__main__':
train()
Please also refer to the official documents of Dask Create DataFrames for further details on how to use Dask DataFrame.
Define Search Space
When running an experiment in the Dask environment, the Transformer and Estimator used in the search space need to support Dask data type. Users can define new search space based on the default search space of HyperGBM which supports Dask.
An example code:
from dask import dataframe as dd
from dask.distributed import LocalCluster, Client
from hypergbm import make_experiment
from hypergbm.dask.search_space import search_space_general
from hypernets.tabular.datasets import dsutils
def my_search_space():
return search_space_general(n_estimators=100)
def train():
cluster = LocalCluster(processes=False)
client = Client(cluster)
train_data = dd.from_pandas(dsutils.load_blood(), npartitions=1)
experiment = make_experiment(train_data, target='Class', searcher='mcts', search_space=my_search_space)
estimator = experiment.run()
print(estimator)
if __name__ == '__main__':
train()
How-To
How to install shap on centos7?
Install system dependencies
yum install epel-release centos-release-scl -y && yum clean all && yum make cache # llvm9.0 is in epel, gcc9 in scl yum install -y llvm9.0 llvm9.0-devel python36-devel devtoolset-9-gcc devtoolset-9-gcc-c++ make cmake
Configure installing environment
whereis llvm-config-9.0-64 # find your `llvm-config` path # llvm-config-9: /usr/bin/llvm-config-9.0-64 export LLVM_CONFIG=/usr/bin/llvm-config-9.0-64 # set to your path scl enable devtoolset-9 bash
Install shap
pip3 -v install numpy==1.19.1 # prepare shap dependency pip3 -v install scikit-learn==0.23.1 # prepare shap dependency pip3 -v install shap==0.28.5
Released Notes
Releasing history:
Version 0.2.3
We add the following new features to this version:
Data cleaning
Support automatically recognizing categorical columns among features with numerical datatypes
Support performing data cleaning with several specific columns reserved
Feature generation
Support datatime, text and Latitude and Longitude features
Support distributed training
Modelling algorithms
XGBoost:Change distributed training from
dask_xgboost
toxgboost.dask
to be compatible with official website of XGBoostLightGBM:Support distributed trianing for more machines
Model training
Support reproducing the searching process
Support searching with low fidelity
Predicting learning curves based on statistical information
Support hyperparameter optimizing without making modification
Time limit of EarlyStopping is now adjusted to the whole experiment life-cycle
Support defining pos_label
eval-set supports Dask dataset for distributed training
Optimizing the cache strategy for model training
Search algorithms
Add GridSearch algorithm
Add Playback algorithm
Advanced Features
Add feature selection with various strategies for the first stage
Feature selection for the second stage now supports more strategies
Pseudo-label supports various data selection strategies and multi-class classification
Optimizing performance of concepts drift handling
Add cache mechanism during processing of advanced features
Visualization
Experiment information visualization
Training process visualization
Command Line tool
Most features of experiments for model training are now supported by command line tools
Support model evaluating
Support model predicting
Version 0.2.2
We add the following new features to this version:
Feature engineering
Feature generating
Feature dimension reduction
Data cleaning
Missing characters handling
Column types correction
Constant columns cleaning
Repeat columns cleaning
Deleating examples with missing targets
Replacing invalid values
id columns cleaning
Dataset splitting
Adversarial validation
Modelling algorithms
XGBoost
Catboost
LightGBM
HistGridientBoosting
Model training
Automatic task inferencing
Command line tools
Evaluation methods
Cross-Validation
Train-Validation-Holdout
Search Algorithms
Monte-Carlo Tree search
Evolution algorithms
Random search
Imbalanced data handling
Class Weight
Under-sampling
Near miss
Tomeks links
Random
Over-sampling
SMOTE
ADASYN
Random
Early-stopping strategy
stopping after n times searching without improving
stopping after using a maximal time
stopping after achieving expected performance
Advanced Features
Two-stage search
Pseudo-label
Feature selection
Concepts drift handling
Model ensemble