DataCanvas¶
HyperGBM is an open source project created by DataCanvas .
Overview¶
What is HyperGBM¶
HyperGBM is a library that supports full-pipeline AutoML, which completely covers the end-to-end stages of data cleaning, preprocessing, feature generation and selection, model selection and hyperparameter optimization.It is a real-AutoML tool for tabular data.
Unlike most AutoML approaches that focus on tackling the hyperparameter optimization problem of machine learning algorithms, HyperGBM can put the entire process from data cleaning to algorithm selection in one search space for optimization. End-to-end pipeline optimization is more like a sequential decision process, thereby HyperGBM uses reinforcement learning, Monte Carlo Tree Search, evolution algorithm combined with a meta-learner to efficiently solve such problems. As the name implies, the ML algorithms used in HyperGBM are all GBM models, and more precisely the gradient boosting tree model, which currently includes XGBoost, LightGBM and Catboost. The underlying search space representation and search algorithm in HyperGBM are powered by the Hypernets project a general AutoML framework.
Main components¶
In this section, we briefly cover the main components in HyperGBM. As shown below:

HyperGBM(HyperModel)
HyperGBM is a specific implementation of HyperModel (for HyperModel, please refer to the Hypernets project). It is the core interface of the HyperGBM project. By calling the
search
method to explore and return the best model in the specifiedSearch Space
with the specifiedSearcher
.Search Space
Search spaces are constructed by arranging ModelSpace(transformer and estimator), ConnectionSpace(pipeline) and ParameterSpace(hyperparameter). The transformers are chained together by pipelines while the pipelines can be nested. The last node of a search space must be an estimator. Each transformer and estimator can define a set of hyperparameterss.

The code example of Numeric Pipeline is as follows:
import numpy as np
from hypergbm.pipeline import Pipeline
from hypergbm.sklearn.transformers import SimpleImputer, StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, LogStandardScaler
from hypernets.core.ops import ModuleChoice, Optional, Choice
from tabular_toolbox.column_selector import column_number_exclude_timedelta
def numeric_pipeline_complex(impute_strategy=None, seq_no=0):
if impute_strategy is None:
impute_strategy = Choice(['mean', 'median', 'constant', 'most_frequent'])
elif isinstance(impute_strategy, list):
impute_strategy = Choice(impute_strategy)
imputer = SimpleImputer(missing_values=np.nan, strategy=impute_strategy, name=f'numeric_imputer_{seq_no}',
force_output_as_float=True)
scaler_options = ModuleChoice(
[
LogStandardScaler(name=f'numeric_log_standard_scaler_{seq_no}'),
StandardScaler(name=f'numeric_standard_scaler_{seq_no}'),
MinMaxScaler(name=f'numeric_minmax_scaler_{seq_no}'),
MaxAbsScaler(name=f'numeric_maxabs_scaler_{seq_no}'),
RobustScaler(name=f'numeric_robust_scaler_{seq_no}')
], name=f'numeric_or_scaler_{seq_no}'
)
scaler_optional = Optional(scaler_options, keep_link=True, name=f'numeric_scaler_optional_{seq_no}')
pipeline = Pipeline([imputer, scaler_optional],
name=f'numeric_pipeline_complex_{seq_no}',
columns=column_number_exclude_timedelta)
return pipeline
Searcher
Searcher is an algorithm used to explore a search space.It encompasses the classical exploration-exploitation trade-off since, on the one hand, it is desirable to find well-performing model quickly, while on the other hand, premature convergence to a region of suboptimal solutions should be avoided. Three algorithms are provided in HyperGBM: MCTSSearcher (Monte-Carlo tree search), EvolutionarySearcher and RandomSearcher.
HyperGBMEstimator
HyperGBMEstimator is an object built from a sample in the search space, including the full preprocessing pipeline and a GBM model. It can be used to
fit
on training data,evaluate
on evaluation data, andpredict
on new data.CompeteExperiment
CompeteExperiment
is a powerful tool provided in HyperGBM. It not only performs pipeline search, but also contains some advanced features to further improve the model performance such as data drift handling, pseudo-labeling, ensemble, etc.
Feature matrix¶
There are 3 training modes:
Standalone
Distributed on single machine
Distributed on multiple machines
Here is feature matrix of training modes:
# |
Feature |
Standalone |
Distributed on single machine |
Distributed on multiple machines |
---|---|---|---|---|
Feature engineering |
Feature generation
Feature dimension reduction
|
√
√
|
√
|
√
|
Data clean |
Correct data type
Special empty value handing
Id-ness features cleanup
Duplicate features cleanup
Empty label rows cleanup
Illegal values replacement
Constant features cleanup
Collinearity features cleanup
|
√
√
√
√
√
√
√
√
|
√
√
√
√
√
√
√
√
|
√
√
√
√
√
√
√
√
|
Data set split |
Adversarial validation |
√
|
√
|
√
|
Modeling algorithms |
XGBoost
Catboost
LightGBM
HistGridientBoosting
|
√
√
√
√
|
√
√
√
|
√
|
Training |
Task inference
Command-line tools
|
√
√
|
√
|
√
|
Evaluation strategies |
Cross-validation
Train-Validation-Holdout
|
√
√
|
√
√
|
√
√
|
Search strategies |
Monte Carlo Tree Search
Evolution
Random search
|
√
√
√
|
√
√
√
|
√
√
√
|
Class balancing |
Class Weight
Under-Samping(Near miss,Tomeks links,Random)
Over-Samping(SMOTE,ADASYN,Random)
|
√
√
√
|
√
|
|
Early stopping strategies |
max_no_improvement_trials
time_limit
expected_reward
|
√
√
√
|
√
√
√
|
√
√
√
|
Advance features |
Two stage search(Pseudo label,Feature selection)
Concept drift handling
Ensemble
|
√
√
√
|
√
√
√
|
√
√
√
|
Installation¶
You can use pip or docker to install HyperGBM.
Using pip¶
It requires Python3.6
or above, and uses pip to install HyperGBM:
pip install --upgrade pip setuptools # (optional)
pip install hypergbm
Install shap(Optional)
HyperGBM provides model interpretation based on shap, you can install it refer to this guide if necessary.
Using Docker¶
You can also use HyperGBM through our built-in jupyter docker image with command:
docker run -ti -e NotebookToken="your-token" -p 8888:8888 datacanvas/hypergbm:0.2.0
Open browser visit site http://<your-ip>:8888
,the token is what you have set “you-token”,it can also be empty if do not specified.
Quick Start¶
The purpose of this guide is to illustrate some of the main features that hypergbm provides. It assumes a basic working knowledge of machine learning practices (dataset, model fitting, predicting, cross-validation, etc.). Please refer to installation instructions for installing hypergbm; You can use hypergbm through the python API and command line tools
This section will show you how to train a binary model using hypergbm.
You can use tabular_toolbox
utility to read Bank Marketing dataset:
>>> from tabular_toolbox.datasets import dsutils
>>> df = dsutils.load_bank()
>>> df[:3]
id age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
Training with make_experiment
¶
Firstly, we load and split the data into training set and test set to train and evaluate the model:
>>> from sklearn.model_selection import train_test_split
>>> from tabular_toolbox.datasets import dsutils
>>> df = dsutils.load_bank()
>>> train_data,test_data = train_test_split(df, test_size=0.3, random_state=9527)
Then, create experiment instance and run it:
>>> from hypergbm import make_experiment
>>> experiment=make_experiment(train_data,target='y')
>>> pipeline=experiment.run(max_trials=10)
>>> pipeline
Pipeline(steps=[('data_clean',
DataCleanStep(cv=True, data_cleaner_args={}, name='data_clean', random_state=9527)),
('estimator',
GreedyEnsemble(weight=[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]))])
The pipeline
is trained model。
We can use sklearn metrics to evaluate the trained model:
>>> from sklearn import metrics
>>> X_test=test_data.copy()
>>> y_test=X_test.pop('y')
>>> y_proba = pipeline.predict_proba(X_test)
>>> metrics.roc_auc_score(y_test, y_proba[:, 1])
0.9659882829799505
Training with CompeteExperiment¶
Load and split the data into training set and test set to train and evaluate the model:
>>> from sklearn.model_selection import train_test_split
>>> from tabular_toolbox.datasets import dsutils
>>> df = dsutils.load_bank()
>>> y = df.pop('y') # target col is "y"
>>> X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=9527)
Hypergbm provides a variety of search strategies. Here, the random search strategy is used to train in the built-in search space:
>>> from hypernets.searchers import RandomSearcher
>>> from hypernets.core import OptimizeDirection
>>> from hypergbm.search_space import search_space_general
>>> rs = RandomSearcher(space_fn=search_space_general,
... optimize_direction=OptimizeDirection.Maximize)
>>> rs
<hypernets.searchers.random_searcher.RandomSearcher object at 0x10e5b9850>
Parameters space_fn
is used to specify the search space;
Meric AUC is used here, and set optimize_direction=OptimizeDirection.Maximize
means the larger the value of the metric, the better .
Then use the Experiment API to train the model:
>>> from hypergbm import HyperGBM, CompeteExperiment
>>> hk = HyperGBM(rs, reward_metric='auc', cache_dir=f'hypergbm_cache', callbacks=[])
>>> experiment = CompeteExperiment(hk, X_train, y_train, X_test=X_test)
19:19:31 I hypergbm.experiment.py 714 - create experiment with ['data_clean', 'drift_detected', 'base_search_and_train']
>>> pipeline = experiment.run(use_cache=True, max_trials=2)
...
Trial No. Reward Elapsed Space Vector
0 1 0.994731 8.490173 [2, 0, 1, 2, 0, 0, 0]
1 2 0.983054 4.980630 [1, 2, 1, 2, 215, 3, 0, 0, 4, 3]
>>> pipeline
Pipeline(steps=[('data_clean',
DataCleanStep(cv=True, data_cleaner_args={}, name='data_clean', random_state=9527)),
('estimator', GreedyEnsemble(weight=[1. 0.]))])
After the training experiment, let’s evaluate the model:
>>> from sklearn import metrics
>>> y_proba = pipeline.predict_proba(X_test)
>>> metrics.roc_auc_score(y_test, y_proba[:, 1])
0.9956872713648863
Training with command-line tools¶
HyperGBM also provides command-line tools to train model and predict data, view the help doc:
hypergm -h
usage: hypergbm [-h] --train_file TRAIN_FILE [--eval_file EVAL_FILE]
[--eval_size EVAL_SIZE] [--test_file TEST_FILE] --target
TARGET [--pos_label POS_LABEL] [--max_trials MAX_TRIALS]
[--model_output MODEL_OUTPUT]
[--prediction_output PREDICTION_OUTPUT] [--searcher SEARCHER]
...
Similarly, taking the training Bank Marketing as an example, we first split the data set into training set and test set and generate the CSV file for command-line tools:
>>> from tabular_toolbox.datasets import dsutils
>>> from sklearn.model_selection import train_test_split
>>> df = dsutils.load_bank()
>>> df_train, df_test = train_test_split(df, test_size=0.3, random_state=9527)
>>> df_train.to_csv('bank_train.csv', index=None)
>>> df_test.to_csv('bank_test.csv', index=None)
The generated CSV files is used as the training command parameters then execute the command::
hypergbm --train_file=bank_train.csv --test_file=bank_test.csv --target=y --pos_label=yes --model_output=model.pkl prediction_output=bank_predict.csv
...
Trial No. Reward Elapsed Space Vector
0 10 1.000000 64.206514 [0, 0, 1, 3, 2, 1, 2, 1, 2, 2, 3]
1 7 0.999990 2.433192 [1, 1, 1, 2, 215, 0, 2, 3, 0, 4]
2 4 0.999950 37.057761 [0, 3, 1, 0, 2, 1, 3, 1, 3, 4, 3]
3 9 0.967292 9.977973 [1, 0, 1, 1, 485, 2, 2, 5, 3, 0]
4 1 0.965844 4.304114 [1, 2, 1, 1, 60, 2, 2, 5, 0, 1]
After the training, the model will be persisted to file model.pkl
and the prediction results will be saved to bank_predict.csv
.
HyperGBM¶
HyperGBM is a specific implementation of HyperModel (for HyperModel, please refer to the Hypernets project).
It is the core interface of the HyperGBM project. By calling the search
method to explore and return the best model in the specified Search Space
with the specified Searcher
.
Required Parameters
searcher: hypernets.searcher.Searcher, A Searcher instance.
hypernets.searchers.RandomSearcher
hypernets.searcher.MCTSSearcher
hypernets.searchers.EvolutionSearcher
Optinal Parameters
dispatcher: hypernets.core.Dispatcher, Dispatcher is used to provide different execution modes for search trials, such as in-process mode (
InProcessDispatcher
), distributed parallel mode (DaskDispatcher
), etc.InProcessDispatcher
is used by default.callbacks: list of callback functions or None, optional (default=None), List of callback functions that are applied at each trial. See
hypernets.callbacks
for more information.reward_metric: str or None, optinal(default=accuracy), Set corresponding metric according to task type to guide search direction of searcher.
task: str or None, optinal(default=None), Task type(binary,multiclass or regression). If None, inference the type of task automatically
param data_cleaner_params: dict, (default=None), Dictionary of parameters to initialize the
DataCleaner
instance. If None,DataCleaner
will initialized with default values.param cache_dir: str or None, (default=None), Path of data cache. If None, uses ‘working directory/tmp/cache’ as cache dir
param clear_cache: bool, (default=True), Whether clear the cache dir before searching
search¶
Required Parameters
X: Pandas or Dask DataFrame, feature data for training
y: Pandas or Dask Series, target values for training
X_eval: (Pandas or Dask DataFrame) or None, feature data for evaluation
y_eval: (Pandas or Dask Series) or None, target values for evaluation
Optinal Parameters
cv: bool, (default=False), If True, use cross-validation instead of evaluation set reward to guide the search process
num_folds: int, (default=3), Number of cross-validated folds, only valid when cv is true
max_trials: int, (default=10), The upper limit of the number of search trials, the search process stops when the number is exceeded
**fit_kwargs: dict, parameters for fit method of model
Use case¶
# import HyperGBM, Search Space and Searcher
from hypergbm import HyperGBM
from hypergbm.search_space import search_space_general
from hypernets.searchers.random_searcher import RandomSearcher
import pandas as pd
from sklearn.model_selection import train_test_split
# instantiate related objects
searcher = RandomSearcher(search_space_general, optimize_direction='max')
hypergbm = HyperGBM(searcher, task='binary', reward_metric='accuracy')
# load data into Pandas DataFrame
df = pd.read_csv('[train_data_file]')
y = df.pop('target')
# split data into train set and eval set
# The evaluation set is used to evaluate the reward of the model fitted with the training set
X_train, X_eval, y_train, y_eval = train_test_split(df, y, test_size=0.3)
# search
hypergbm.search(X_train, y_train, X_eval, y_eval, max_trials=30)
# load best model
best_trial = hypergbm.get_best_trial()
estimator = hypergbm.load_estimator(best_trial.model_file)
# predict on real data
pred = estimator.predict(X_real)
Searchers¶
MCTSSearcher¶
Monte-Carlo Tree Search (MCTS) extends the celebrated Multi-armed Bandit algorithm to tree-structured search spaces. The MCTS algorithm iterates over four phases: selection, expansion, playout and backpropagation.
Selection: In each node of the tree, the child node is selected after a Multi-armed Bandit strategy, e.g. the UCT (Upper Confidence bound applied to Trees) algorithm.
Expansion: The algorithm adds one or more nodes to the tree. This node corresponds to the first encountered position that was not added in the tree.
Playout: When reaching the limits of the visited tree, a roll-out strategy is used to select the options until reaching a terminal node and computing the associated reward.
Backpropagation: The reward value is propagated back, i.e. it is used to update the value associated to all nodes along the visited path up to the root node.
Code example
from hypernets.searchers import MCTSSearcher
searcher = MCTSSearcher(search_space_fn, use_meta_learner=False, max_node_space=10, candidates_size=10, optimize_direction='max')
Required Parameters
space_fn: callable, A search space function which when called returns a
HyperSpace
instance.
Optinal Parameters
policy: hypernets.searchers.mcts_core.BasePolicy, (default=None), The policy for Selection and Backpropagation phases,
UCT
by default.max_node_space: int, (default=10), Maximum space for node expansion
use_meta_learner: bool, (default=True), Meta-learner aims to evaluate the performance of unseen samples based on previously evaluated samples. It provides a practical solution to accurately estimate a search branch with many simulations without involving the actual training.
candidates_size: int, (default=10), The number of samples for the meta-learner to evaluate candidate paths when roll out
optimize_direction: ‘min’ or ‘max’, (default=’min’), Whether the search process is approaching the maximum or minimum reward value.
space_sample_validation_fn: callable or None, (default=None), Used to verify the validity of samples from the search space, and can be used to add specific constraint rules to the search space to reduce the size of the space.
EvolutionSearcher¶
Evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions (see also loss function). Evolution of the population then takes place after the repeated application of the above operators.
Code example
from hypernets.searchers import EvolutionSearcher
searcher = EvolutionSearcher(search_space_fn, population_size=20, sample_size=5, optimize_direction='min')
Required Parameters
space_fn: callable, A search space function which when called returns a
HyperSpace
instancepopulation_size: int, Size of population
sample_size: int, The number of parent candidates selected in each cycle of evolution
Optinal Parameters
regularized: bool, (default=False), Whether to enable regularized
use_meta_learner: bool, (default=True), Meta-learner aims to evaluate the performance of unseen samples based on previously evaluated samples. It provides a practical solution to accurately estimate a search branch with many simulations without involving the actual training.
candidates_size: int, (default=10), The number of samples for the meta-learner to evaluate candidate paths when roll out
optimize_direction: ‘min’ or ‘max’, (default=’min’), Whether the search process is approaching the maximum or minimum reward value.
space_sample_validation_fn: callable or None, (default=None), Used to verify the validity of samples from the search space, and can be used to add specific constraint rules to the search space to reduce the size of the space.
RandomSearcher¶
As its name suggests, Random Search uses random combinations of hyperparameters. Code example
from hypernets.searchers import RandomSearcher
searcher = RandomSearcher(search_space_fn, optimize_direction='min')
Required Parameters
space_fn: callable, A search space function which when called returns a
HyperSpace
instance
Optinal Parameters
optimize_direction: ‘min’ or ‘max’, (default=’min’), Whether the search process is approaching the maximum or minimum reward value.
space_sample_validation_fn: callable or None, (default=None), Used to verify the validity of samples from the search space, and can be used to add specific constraint rules to the search space to reduce the size of the space.
Search Space¶
Build-in Search Space¶
Code example
from hypergbm.search_space import search_space_general
searcher = RandomSearcher(search_space_general, optimize_direction='min')
# or
searcher = RandomSearcher(lambda: search_space_general(n_estimators=300, early_stopping_rounds=10, verbose=0), optimize_direction='min')
CompeteExperiment¶
There are still many challenges in the machine learning modeling process for tabular data, such as imbalanced data, data drift, poor generalization ability, etc. This challenges cannot be completely solved by pipeline search, so we introduced in HyperGBM a more powerful tool is CompeteExperiment
.
CompteExperiment
is composed of a series of steps and Pipeline Search is just one step. It also includes advanced steps such as data cleaning, data drift handling, two-stage search, ensemble etc., as shown in the figure below:
Code example
from hypergbm import make_experiment
from hypergbm.search_space import search_space_general
import pandas as pd
import logging
# load data into Pandas DataFrame
df = pd.read_csv('[train_data_file]')
target = 'target'
#create an experiment
experiment = make_experiment(df, target=target,
search_space=lambda: search_space_general(class_balancing='SMOTE',n_estimators=300, early_stopping_rounds=10, verbose=0),
collinearity_detection=False,
drift_detection=True,
feature_reselection=False,
feature_reselection_estimator_size=10,
feature_reselection_threshold=1e-5,
ensemble_size=20,
pseudo_labeling=False,
pseudo_labeling_proba_threshold=0.8,
pseudo_labeling_resplit=False,
retrain_on_wholedata=False,
log_level=logging.ERROR,)
#run experiment
estimator = experiment.run()
# predict on real data
pred = estimator.predict(X_real)
Required Parameters
hyper_model: hypergbm.HyperGBM, A
HyperGBM
instanceX_train: Pandas or Dask DataFrame, Feature data for training
y_train: Pandas or Dask Series, Target values for training
Optinal Parameters
X_eval: (Pandas or Dask DataFrame) or None, (default=None), Feature data for evaluation
y_eval: (Pandas or Dask Series) or None, (default=None), Target values for evaluation
X_test: (Pandas or Dask Series) or None, (default=None), Unseen data without target values for semi-supervised learning
eval_size: float or int, (default=None), Only valid when
X_eval
ory_eval
is None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the eval split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size.train_test_split_strategy: ‘adversarial_validation’ or None, (default=None), Only valid when
X_eval
ory_eval
is None. If None, use eval_size to split the dataset, otherwise use adversarial validation approach.cv: bool, (default=False), If True, use cross-validation instead of evaluation set reward to guide the search process
num_folds: int, (default=3), Number of cross-validated folds, only valid when cv is true
task: str or None, optinal(default=None), Task type(binary, multiclass or regression). If None, inference the type of task automatically
callbacks: list of callback functions or None, (default=None), List of callback functions that are applied at each experiment step. See
hypernets.experiment.ExperimentCallback
for more information.random_state: int or RandomState instance, (default=9527), Controls the shuffling applied to the data before applying the split.
scorer: str, callable or None, (default=None), Scorer to used for feature importance evaluation and ensemble. It can be a single string (see get_scorer) or a callable (see make_scorer). If None, exception will occur.
data_cleaner_args: dict, (default=None), dictionary of parameters to initialize the
DataCleaner
instance. If None,DataCleaner
will initialized with default values.collinearity_detection: bool, (default=False), Whether to clear multicollinearity features
drift_detection: bool,(default=True), Whether to enable data drift detection and processing. Only valid when X_test is provided. Concept drift in the input data is one of the main challenges. Over time, it will worsen the performance of model on new data. We introduce an adversarial validation approach to concept drift problems in HyperGBM. This approach will detect concept drift and identify the drifted features and process them automatically.
feature_reselection: bool, (default=True), Whether to enable two stage feature selection and searching
feature_reselection_estimator_size: int, (default=10), The number of estimator to evaluate feature importance. Only valid when feature_reselection is True.
feature_reselection_threshold: float, (default=1e-5), The threshold for feature selection. Features with importance below the threshold will be dropped. Only valid when feature_reselection is True.
ensemble_size: int, (default=20), The number of estimator to ensemble. During the AutoML process, a lot of models will be generated with different preprocessing pipelines, different models, and different hyperparameters. Usually selecting some of the models that perform well to ensemble can obtain better generalization ability than just selecting the single best model.
pseudo_labeling: bool, (default=False), Whether to enable pseudo labeling. Pseudo labeling is a semi-supervised learning technique, instead of manually labeling the unlabelled data, we give approximate labels on the basis of the labelled data. Pseudo-labeling can sometimes improve the generalization capabilities of the model.
pseudo_labeling_proba_threshold: float, (default=0.8), Confidence threshold of pseudo-label samples. Only valid when feature_reselection is True.
pseudo_labeling_resplit: bool, (default=False), Whether to re-split the training set and evaluation set after adding pseudo-labeled data. If False, the pseudo-labeled data is only appended to the training set. Only valid when feature_reselection is True.
retrain_on_wholedata: bool, (default=False), Whether to retrain the model with whole data after the search is completed.
log_level: int or None, (default=None), Level of logging, possible values:[logging.CRITICAL, logging.FATAL, logging.ERROR, logging.WARNING, logging.WARN, logging.INFO, logging.DEBUG, logging.NOTSET]
Examples¶
Experiment Examples¶
Basic Usages¶
In this chapter we’ll show how to train models with HyperGBM experiment, we’ll use the blood
dataset in the following examples,Class
is the target feature.
Recency,Frequency,Monetary,Time,Class
2,50,12500,98,1
0,13,3250,28,1
1,16,4000,35,1
2,20,5000,45,1
1,24,6000,77,0
4,4,1000,4,0
...
Use experiment with default settings¶
User can create experiment instance with the python tool make_experiment
and run it quickly。train_data
is the only required parameter, all others are optional. The target
is also required if your target feature name isn’t y
。
Codes:
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class')
estimator = experiment.run()
print(estimator)
Outputs:
Pipeline(steps=[('data_clean',
DataCleanStep(...),
('estimator',
GreedyEnsemble(...)])
Process finished with exit code 0
As the console output, the trained model is a pipeline object,the estimator is ensembled by several other models。
If your training data files are .csv or .parquet files,user can call make_experiment
with the file path directly,like the following:
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
train_data = '/path/to/mydata.csv'
experiment = make_experiment(train_data, target='my_target')
estimator = experiment.run()
print(estimator)
Cross Validation¶
make_experiment
enable cross validation as default, user can disable it by set cv= False
. Use can change cross fold number with num_folds
, just like this:
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=True, num_folds=5)
estimator = experiment.run()
print(estimator)
Setup evaluate data (eval_data)¶
Experiment split evaluate data from train_data
by default if cross validation is disabled, user can customize it with eval_data
like this:
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
from sklearn.model_selection import train_test_split
train_data = dsutils.load_blood()
train_data,eval_data=train_test_split(train_data,test_size=0.3)
experiment = make_experiment(train_data, target='Class', eval_data=eval_data, cv=False)
estimator = experiment.run()
print(estimator)
If eval_data
is None and cv
is False, the experiment will split evaluation data from train_data
, user can change evaluation data size with eval_size
, like this:
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=False, eval_size=0.2)
estimator = experiment.run()
print(estimator)
Setup search reward metric¶
The default search reward metric is accuracy
,user can change it with reward_metric
, like this:
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', reward_metric='auc')
estimator = experiment.run()
print(estimator)
Change search trial number and setup early stopping¶
User can limit search trial number with max_trials
,and setup search early stopping with early_stopping_round
, early_stopping_time_limit
, early_stopping_reward
. like this:
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', max_trials=30, early_stopping_time_limit=3600 * 3)
estimator = experiment.run()
print(estimator)
Drift detection¶
To enable the feature drift detection, set drift_detection=True
, and set test_data
with the testing data, like this:
from io import StringIO
import pandas as pd
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
test_data = """
Recency,Frequency,Monetary,Time
2,10,2500,64
4,5,1250,23
4,9,2250,46
4,5,1250,23
4,8,2000,40
2,12,3000,82
11,24,6000,64
2,7,1750,46
4,11,2750,61
1,7,1750,57
2,11,2750,79
2,3,750,16
4,5,1250,26
2,6,1500,41
"""
train_data = dsutils.load_blood()
test_df = pd.read_csv(StringIO(test_data))
experiment = make_experiment(train_data, test_data=test_df, target='Class', drift_detection=True)
estimator = experiment.run()
print(estimator)
Multicollinearity detection¶
To enable multicollinearity detection, set collinearity_detection=True
, like this:
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', `collinearity_detection=True)
estimator = experiment.run()
print(estimator)
Pseudo labeling¶
To enable pseudo labeling with two stage searching, set pseudo_labeling=True
, like this:
train_data=...
experiment = make_experiment(train_data, pseudo_labeling=True, ...)
Permutation importance feature selection¶
To enable feature selection by permutation importance with two stage searching, set feature_reselection=True
, like this:
train_data=...
experiment = make_experiment(train_data, feature_reselection=True, ...)
Ensemble¶
To change estimator number for ensemble, set ensemble_size
to expected number. Or set ensemble_size=0
to disable ensemble.
train_data = ...
experiment = make_experiment(train_data, ensemble_size=10, ...)
Logging settings¶
To change logging level, set log_level
with log level defined in python logging utility.
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', log_level='INFO', verbose=1)
estimator = experiment.run()
print(estimator)
Outputs:
14:24:33 I tabular_toolbox.u._common.py 30 - 2 class detected, {0, 1}, so inferred as a [binary classification] task
14:24:33 I hypergbm.experiment.py 699 - create experiment with ['data_clean', 'drift_detection', 'space_search', 'final_ensemble']
14:24:33 I hypergbm.experiment.py 1262 - make_experiment with train data:(748, 4), test data:None, eval data:None, target:Class
14:24:33 I hypergbm.experiment.py 716 - fit_transform data_clean
14:24:33 I hypergbm.experiment.py 716 - fit_transform drift_detection
14:24:33 I hypergbm.experiment.py 716 - fit_transform space_search
14:24:33 I hypernets.c.meta_learner.py 22 - Initialize Meta Learner: dataset_id:7123e0d8c8bbbac8797ed9e42352dc59
14:24:33 I hypernets.c.callbacks.py 192 -
Trial No:1
--------------------------------------------------------------
(0) estimator_options.hp_or: 0
(1) numeric_imputer_0.strategy: most_frequent
(2) numeric_scaler_optional_0.hp_opt: True
...
14:24:35 I hypergbm.experiment.py 716 - fit_transform final_ensemble
14:24:35 I hypergbm.experiment.py 737 - trained experiment pipeline: ['data_clean', 'estimator']
Pipeline(steps=[('data_clean',
DataCleanStep(...),
('estimator',
GreedyEnsemble(...)
Process finished with exit code 0
Advanced Usages¶
Customize Searcher and Search Space¶
User can customize searcher and search space with searcher
and search_space
, like this:
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
from hypergbm.search_space import search_space_general
def my_search_space():
return search_space_general(n_esitimators=100)
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', searcher='random', search_space=my_search_space)
estimator = experiment.run()
print(estimator)
Or like this:
from hypergbm import make_experiment
from hypergbm.search_space import search_space_general
from hypernets.searchers import MCTSSearcher
from tabular_toolbox.datasets import dsutils
my_searcher = MCTSSearcher(lambda: search_space_general(n_esitimators=100),
max_node_space=20,
optimize_direction='max')
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', searcher=my_searcher)
estimator = experiment.run()
print(estimator)
Use CompeteExperiment¶
Use can create experiment with class CompeteExperiment
for more details.
from hypergbm import HyperGBM, CompeteExperiment
from hypergbm.search_space import search_space_general
from hypernets.core.callbacks import EarlyStoppingCallback, SummaryCallback
from hypernets.searchers import EvolutionSearcher
from tabular_toolbox.datasets import dsutils
train_data = dsutils.load_blood()
def my_search_space():
return search_space_general(early_stopping_rounds=10, verbose=0, cat_pipeline_mode='complex')
searcher = EvolutionSearcher(my_search_space,
optimize_direction='max', population_size=30, sample_size=10,
regularized=True, candidates_size=10)
es = EarlyStoppingCallback(time_limit=3600 * 3, mode='max')
hm = HyperGBM(searcher, reward_metric='auc', cache_dir=f'hypergbm_cache', clear_cache=True,
callbacks=[es, SummaryCallback()])
X = train_data
y = train_data.pop('Class')
experiment = CompeteExperiment(hm, X, y, eval_size=0.2,
cv=True, pseudo_labeling=False,
max_trials=20, use_cache=True)
estimator = experiment.run()
print(estimator)
Distribution with Dask¶
Quick Start¶
To run HyperGBM experiment with Dask cluster, use need to setup the default Dask client before call make_experiment
, like this:
from dask.distributed import LocalCluster, Client
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
def train():
cluster = LocalCluster(processes=True)
client = Client(cluster)
train_data = '/opt/data/my_data.csv'
experiment = make_experiment(train_data, target='...')
estimator = experiment.run()
print(estimator)
if __name__ == '__main__':
train()
User can also use dask.dataframe
load training data set Dask DataFrame to create experiment:
from dask import dataframe as dd
from dask.distributed import LocalCluster, Client
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
def train():
cluster = LocalCluster(processes=False)
client = Client(cluster)
train_data = dd.from_pandas(dsutils.load_blood(), npartitions=1)
experiment = make_experiment(train_data, target='Class')
estimator = experiment.run()
print(estimator)
if __name__ == '__main__':
train()
Reference Dask Create DataFrames for more details
Customize Search Space¶
To run experiment with Dask cluster, all transformers and estimators must support Dask objects, reference hypergbm.dask.search_space.search_space_general
for more details to customize search space pls。
from dask import dataframe as dd
from dask.distributed import LocalCluster, Client
from hypergbm import make_experiment
from hypergbm.dask.search_space import search_space_general
from tabular_toolbox.datasets import dsutils
def my_search_space():
return search_space_general(n_esitimators=100)
def train():
cluster = LocalCluster(processes=False)
client = Client(cluster)
train_data = dd.from_pandas(dsutils.load_blood(), npartitions=1)
experiment = make_experiment(train_data, target='Class', searcher='mcts', search_space=my_search_space)
estimator = experiment.run()
print(estimator)
if __name__ == '__main__':
train()
Using HyperGBM¶
Basic examples¶
Cross-validation¶
HyperGBM supports cross-validation to evaluate the model, specify cv=True
to enable it and param num_fold
used to set folds:
...
hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
hk.search(X_train, y_train, X_eval=None, y_eval=None, cv=True, num_folds=3) # 3 folds
...
Evaluation data should be a fold of X_train
and y_train
, so set X_eval=None
and y_eval=None
.
Here is an example :
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>>
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>>
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM
>>> from hypernets.searchers import MCTSSearcher
>>>
>>> rs = MCTSSearcher(search_space_general, max_node_space=10, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> hk.search(X_train, y_train, X_eval=None, y_eval=None, cv=True, num_folds=3) # using Cross Validation
Trial No. Reward Elapsed Space Vector
0 4 0.941667 0.331012 [1, 3, 1, 1, 370, 3, 2, 3, 4, 0]
1 7 0.933333 0.290077 [0, 0, 1, 0, 3, 1, 1, 2, 1, 2, 3]
2 1 0.925000 0.472835 [0, 0, 0, 3, 0, 1, 0, 2, 0, 4]
3 3 0.925000 0.422006 [0, 1, 0, 1, 1, 1, 1, 0, 0, 1]
4 8 0.925000 0.228165 [0, 1, 0, 3, 2, 0, 2, 0, 2, 0]
>>> estimator = hk.load_estimator(hk.get_best_trial().model_file)
>>>
>>> estimator.cv_gbm_models_
[LGBMClassifierWrapper(boosting_type='dart', learning_rate=0.5, max_depth=10,
n_estimators=200, num_leaves=370, reg_alpha=1,
reg_lambda=1), LGBMClassifierWrapper(boosting_type='dart', learning_rate=0.5, max_depth=10,
n_estimators=200, num_leaves=370, reg_alpha=1,
reg_lambda=1), LGBMClassifierWrapper(boosting_type='dart', learning_rate=0.5, max_depth=10,
n_estimators=200, num_leaves=370, reg_alpha=1,
reg_lambda=1)]
Search strategies¶
HyperGBM provides following search strategies(implementation class):
Evolution search(hypernets.searchers.evolution_searcher.EvolutionSearcher)
Monte Carlo Tree Search(hypernets.searchers.mcts_searcher.MCTSSearcher)
Random search(hypernets.searchers.random_searcher.RandomSearcher)
Here is an example that using evolution search strategy:
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>>
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>>
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> rs = EvolutionSearcher(search_space_general, 200, 100, optimize_direction='max') # using EvolutionSearcher
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> hk.search(X_train, y_train, X_eval=X_test, y_eval=y_test)
Trial No. Reward Elapsed Space Vector
0 1 1.0 0.187103 [1, 2, 0, 1, 160, 3, 0, 1, 2]
1 2 1.0 0.358584 [2, 3, 1, 3, 2, 0, 0]
2 3 1.0 0.127980 [1, 1, 1, 0, 125, 0, 0, 3, 3, 0]
3 4 1.0 0.084272 [1, 1, 0, 2, 115, 1, 2, 3, 0]
4 7 1.0 0.152720 [1, 0, 0, 1, 215, 3, 3, 1, 2]
>>> estimator = hk.load_estimator(hk.get_best_trial().model_file)
>>> y_pred = estimator.predict(X_test)
>>>
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, y_pred)
1.0
Early stopping¶
When the performance of the model can not be improved or meet certain conditions, the training can be terminated in advance to release computing resources, now supported strategies:
max_no_improvement_trials
time_limit
expected_reward
When multiple conditions are set, it will stop when any condition is reached first;
The early stop strategy is implemented through class hypernets.core.callbacks.EarlyStoppingCallback
;
Here is an example that training stops searching when the reward reaches above 0.95:
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> from hypernets.core import EarlyStoppingCallback
>>>
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> rs = EvolutionSearcher(search_space_general, 200, 100, optimize_direction='max')
>>> es = EarlyStoppingCallback(expected_reward=0.95, mode='max') # Parameter `mode` is the direction of parameter `expected_reward` optimization, the reward metric is accuracy, so set mode to `max`
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[es])
>>> hk.search(X_train, y_train, X_eval=X_test, y_eval=y_test)
Early stopping on trial : 1, best reward: None, best_trial: None
Trial No. Reward Elapsed Space Vector
0 1 1.0 0.189758 [0, 1, 1, 3, 2, 1, 1, 2, 3, 0, 0]
Advanced examples¶
Pseudo label¶
HyperGBM is allowed to use test set training in a semi-supervised way to improve model performance, usage:
...
experiment = CompeteExperiment(hk, X_train, y_train, X_test=X_test, callbacks=[], scorer=get_scorer('accuracy'),
pseudo_labeling=True, # Enable pseudo label
pseudo_labeling_proba_threshold=0.9)
...
Here is an example:
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>>
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM, CompeteExperiment
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> from sklearn.metrics import get_scorer
>>>
>>> rs = EvolutionSearcher(search_space_general, 200, 100, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> experiment = CompeteExperiment(hk, X_train, y_train, X_test=X_test, callbacks=[], scorer=get_scorer('accuracy'),
... pseudo_labeling=True, # enable pseudo
... pseudo_labeling_proba_threshold=0.9)
>>>
>>> pipeline = experiment.run(use_cache=True, max_trials=10) # first stage train a model to label test dataset, the second stage train using labeled test dataset and train dataset
Trial No. Reward Elapsed Space Vector
0 3 0.972222 0.194367 [0, 3, 1, 2, 3, 1, 3, 0, 0, 1, 0]
1 5 0.972222 0.130711 [0, 2, 1, 0, 2, 0, 3, 0, 1, 4, 3]
2 8 0.972222 0.113038 [0, 1, 0, 0, 1, 0, 2, 0, 2, 3]
3 10 0.972222 0.134826 [1, 2, 0, 0, 500, 3, 2, 3, 4]
4 1 0.944444 0.251970 [2, 2, 0, 3, 1, 2]
Trial No. Reward Elapsed Space Vector
0 1 0.972222 0.338019 [2, 0, 1, 0, 2, 4, 1]
1 2 0.972222 0.232059 [2, 3, 1, 1, 0, 4, 1]
2 3 0.972222 0.207254 [2, 3, 0, 3, 0, 2]
3 4 0.972222 0.262670 [2, 1, 1, 2, 1, 1, 0]
4 6 0.972222 0.246977 [2, 3, 0, 3, 1, 1]
>>> pipeline
Pipeline(steps=[('data_clean',
DataCleanStep(data_cleaner_args={}, name='data_clean',
random_state=9527)),
('estimator',
GreedyEnsemble(weight=[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.], estimators=[<hypergbm.estimators.CatBoostClassifierWrapper object at 0x1a38139110>, None, None, None, None, None, None, None, None, None]))])
>>> import numpy as np
>>> y_pred = pipeline.predict(X_test).astype(np.float64)
>>>
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_pred, y_test)
1.0
Features selection¶
HyperGBM turn features into noise one by one for training, the more the model performance degradation, the more important the features become noise, so as to evaluate the importance of features. Accord to features importance select part of the features and retraining model to save computing resources and time, here is an example:
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>>
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM, CompeteExperiment
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> from sklearn.metrics import get_scorer
>>>
>>> rs = EvolutionSearcher(search_space_general, 200, 100, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>>
>>> experiment = CompeteExperiment(hk, X_train, y_train, X_test, y_test, callbacks=[], scorer=get_scorer('accuracy'),
... feature_reselection=True, # enable feature importance selection
... feature_reselection_estimator_size=3, # use 3 estimators to evaluate feature importance
... feature_reselection_threshold=0.01) # importance less than the threshold will not be selected
>>> pipeline = experiment.run(use_cache=True, max_trials=10)
Trial No. Reward Elapsed Space Vector
0 2 1.0 0.373262 [2, 3, 0, 2, 2, 1]
1 3 1.0 0.194120 [1, 3, 1, 1, 365, 1, 3, 1, 0, 3]
2 4 1.0 0.109643 [1, 0, 1, 2, 140, 0, 2, 3, 4, 1]
3 6 1.0 0.107316 [0, 3, 0, 2, 2, 0, 1, 2, 2, 2]
4 7 1.0 0.117224 [1, 0, 1, 2, 40, 2, 1, 2, 4, 0]
feature importance std
0 sepal length (cm) 0.000000 0.000000
1 sepal width (cm) 0.011111 0.015713
2 petal length (cm) 0.495556 0.199580
3 petal width (cm) 0.171111 0.112787
Trial No. Reward Elapsed Space Vector
0 1 1.0 0.204705 [0, 1, 0, 2, 0, 1, 3, 0, 4, 3]
1 2 1.0 0.109204 [1, 1, 1, 2, 90, 1, 2, 0, 0, 1]
2 3 1.0 0.160209 [1, 2, 1, 0, 305, 3, 0, 0, 1, 1]
3 4 1.0 1.062759 [2, 1, 1, 2, 3, 1, 0]
4 6 1.0 0.218692 [0, 0, 0, 1, 0, 1, 2, 0, 0, 3]
>>>
>>> import numpy as np
>>> y_pred = pipeline.predict(X_test).astype(np.float64)
>>>
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_pred, y_test)
1.0
Concept drift¶
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>>
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>>
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM, CompeteExperiment
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> from sklearn.metrics import get_scorer
>>>
>>> rs = EvolutionSearcher(search_space_general, 200, 100, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>>
>>> experiment = CompeteExperiment(hk, X_train, y_train, X_test, y_test, callbacks=[], scorer=get_scorer('accuracy'),
... drift_detection=True) # enable drift detection
>>> pipeline = experiment.run(use_cache=True, max_trials=10)
Trial No. Reward Elapsed Space Vector
0 1 1.0 0.236796 [2, 2, 1, 3, 0, 4, 2]
1 3 1.0 0.207033 [0, 0, 0, 4, 1, 1, 2, 2, 1, 3]
2 4 1.0 0.106351 [1, 2, 0, 2, 240, 3, 2, 1, 2]
3 5 1.0 0.110495 [0, 0, 0, 2, 2, 0, 2, 1, 2, 2]
4 6 1.0 0.175838 [0, 3, 1, 3, 2, 1, 3, 1, 1, 4, 1]
>>> import numpy as np
>>> y_pred = pipeline.predict(X_test).astype(np.float64)
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_pred, y_test)
1.0
Ensemble¶
HyperGBM supports the combination of better models generated in the search process to a model with better generalization ability, example:
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>>
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>>
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM, CompeteExperiment
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> from sklearn.metrics import get_scorer
>>>
>>> rs = EvolutionSearcher(search_space_general, 200, 100, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> experiment = CompeteExperiment(hk, X_train, y_train, X_test, y_test, callbacks=[], scorer=get_scorer('accuracy'),
... ensemble_size=5) # set ensemble
>>> pipeline = experiment.run(use_cache=True, max_trials=10)
Trial No. Reward Elapsed Space Vector
0 1 1.0 0.856545 [2, 1, 1, 1, 3, 4, 1]
1 2 1.0 0.271147 [2, 0, 0, 1, 0, 2]
2 3 1.0 0.160234 [1, 0, 1, 0, 45, 2, 1, 3, 4, 0]
3 4 1.0 0.279989 [2, 0, 1, 0, 0, 1, 4]
4 5 1.0 0.262032 [2, 3, 1, 1, 0, 3, 2]
>>> import numpy as np
>>> y_pred = pipeline.predict(X_test).astype(np.float64)
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_pred, y_test)
1.0
Distributed¶
Custom search space¶
Feature generation¶
More features can be generated based on continuous features, such as the difference between two columns:
>>> import pandas as pd
>>> df = pd.DataFrame(data={"x1": [1, 2, 4], "x2": [9, 8, 7]})
>>> df
x1 x2
0 1 9
1 2 8
2 4 7
>>> from hypergbm.feature_generators import FeatureGenerationTransformer
>>> ft = FeatureGenerationTransformer(trans_primitives=['subtract_numeric'])
>>> ft.fit(df)
<hypergbm.feature_generators.FeatureGenerationTransformer object at 0x101839d10>
>>> ft.transform(df)
x1 x2 x1 - x2
e_hypernets_ft_index
0 1 9 -8
1 2 8 -6
2 4 7 -3
In addition to the subtract_numeric
operation, it also support:
add_numeric
subtract_numeric
divide_numeric
multiply_numeric
negate
modulo_numeric
modulo_by_feature
cum_mean
cum_sum
cum_min
cum_max
percentile
absolute
It can also extract fields such as year, month, day and etc. from the datetime feature:
>>> import pandas as pd
>>> from datetime import datetime
>>> df = pd.DataFrame(data={"x1": pd.to_datetime([datetime.now()] * 10)})
>>> df[:3]
x1
0 2021-01-25 10:27:54.776580
1 2021-01-25 10:27:54.776580
2 2021-01-25 10:27:54.776580
>>> from hypergbm.feature_generators import FeatureGenerationTransformer
>>> ft = FeatureGenerationTransformer(trans_primitives=["year", "month", "week", "minute", "day", "hour", "minute", "second", "weekday", "is_weekend"])
>>> ft.fit(df)
<hypergbm.feature_generators.FeatureGenerationTransformer object at 0x1a29624dd0>
>>> ft.transform(df)
x1 YEAR(x1) MONTH(x1) WEEK(x1) MINUTE(x1) DAY(x1) HOUR(x1) SECOND(x1) WEEKDAY(x1) IS_WEEKEND(x1)
e_hypernets_ft_index
0 2021-01-25 10:27:54.776580 2021 1 4 27 25 10 54 0 False
1 2021-01-25 10:27:54.776580 2021 1 4 27 25 10 54 0 False
2 2021-01-25 10:27:54.776580 2021 1 4 27 25 10 54 0 False
3 2021-01-25 10:27:54.776580 2021 1 4 27 25 10 54 0 False
Using feature generation in search space:
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> from hypergbm.estimators import XGBoostEstimator
>>> from hypergbm.pipeline import Pipeline
>>> from hypergbm.sklearn.transformers import FeatureGenerationTransformer
>>> from hypernets.core.ops import ModuleChoice, HyperInput
>>> from hypernets.core.search_space import HyperSpace
>>> from tabular_toolbox.column_selector import column_exclude_datetime
>>>
>>> def search_space(task=None): # Define a search space include feature geeration
... space = HyperSpace()
... with space.as_default():
... input = HyperInput(name='input1')
... feature_gen = FeatureGenerationTransformer(task=task, # Add feature generation to search space
... trans_primitives=["add_numeric", "subtract_numeric", "divide_numeric", "multiply_numeric"])
... full_pipeline = Pipeline([feature_gen], name=f'feature_gen_and_preprocess', columns=column_exclude_datetime)(input)
... xgb_est = XGBoostEstimator(fit_kwargs={})
... ModuleChoice([xgb_est], name='estimator_options')(full_pipeline)
... space.set_inputs(input)
... return space
>>>
>>> from hypergbm import HyperGBM
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>>
>>> rs = EvolutionSearcher(search_space, 200, 100, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> hk.search(X_train, y_train, X_eval=X_test, y_eval=y_test)
Trial No. Reward Elapsed Space Vector
0 1 1.0 0.376869 []
>>> estimator = hk.load_estimator(hk.get_best_trial().model_file)
>>> y_pred = estimator.predict(X_test)
>>>
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, y_pred)
1.0
Using GBM estimators¶
The GBM algorithms (wrapper class) supported by HyperGBM are:
XGBoost (hypergbm.estimators.XGBoostEstimator)
HistGB (hypergbm.estimators.HistGBEstimator)
LightGBM (hypergbm.estimators.LightGBMEstimator)
CatBoost (hypergbm.estimators.CatBoostEstimator)
The hyper-parameters are defined into the search space to use in training, here is an example that using xgboost to train iris:
# Load dataset
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X[:3]
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
>>> y[:3]
0 0
1 0
2 0
Name: target, dtype: int64
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> from hypergbm.estimators import XGBoostEstimator
>>> from hypergbm.estimators import XGBoostEstimator
>>> from hypergbm.pipeline import Pipeline, DataFrameMapper
>>> from hypergbm.sklearn.transformers import MinMaxScaler, StandardScaler
>>> from hypernets.core import OptimizeDirection
>>> from hypernets.core.ops import ModuleChoice, HyperInput
>>> from hypernets.core.search_space import HyperSpace
>>> from tabular_toolbox.column_selector import column_number_exclude_timedelta
# Define search space included XGBoost
>>> def search_space():
... space = HyperSpace()
... with space.as_default():
... input = HyperInput(name='input1')
... scaler_choice = ModuleChoice(
... [
... StandardScaler(name=f'numeric_standard_scaler'),
... MinMaxScaler(name=f'numeric_minmax_scaler')
... ], name=f'numeric_or_scaler'
... )
... num_pipeline = Pipeline([scaler_choice], name='numeric_pipeline', columns=column_number_exclude_timedelta)(input)
... union_pipeline = DataFrameMapper(default=None, input_df=True, df_out=True)([num_pipeline])
... xgb_est = XGBoostEstimator(fit_kwargs={})
... ModuleChoice([xgb_est], name='estimator_options')(union_pipeline) # Make xgboost as a estimator choice
... space.set_inputs(input)
... return space
# Search
>>> from hypergbm import HyperGBM
>>> from hypernets.searchers import MCTSSearcher
>>> rs = MCTSSearcher(search_space, max_node_space=10, optimize_direction=OptimizeDirection.Maximize)
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>>
>>> hk.search(X_train, y_train, X_eval=X_test, y_eval=y_test)
Trial No. Reward Elapsed Space Vector
0 1 1.0 0.206926 [0]
1 2 1.0 0.069099 [1]
Class balancing¶
HyperGBM supports several strategies for unbalanced data sampling:
Class weight
ClassWeight
Over sampling
RandomOverSampling
SMOTE
ADASYN
Down sampling
RandomUnderSampling
NearMiss
TomeksLinks
Configure class balancing policies in estimator:
...
xgb_est = XGBoostEstimator(fit_kwargs={}, class_balancing='ClassWeight') # Use class balancing
...
Here is an example that training with ClassWeight
sampling strategy:
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> from hypergbm.estimators import XGBoostEstimator
>>> from hypergbm.pipeline import Pipeline, DataFrameMapper
>>> from hypergbm.sklearn.transformers import MinMaxScaler, StandardScaler
>>> from hypernets.core.ops import ModuleChoice, HyperInput
>>> from hypernets.core.search_space import HyperSpace
>>> from tabular_toolbox.column_selector import column_number_exclude_timedelta
>>>
>>> def search_space():
... space = HyperSpace()
... with space.as_default():
... input = HyperInput(name='input1')
... scaler_choice = ModuleChoice(
... [
... StandardScaler(name=f'numeric_standard_scaler'),
... MinMaxScaler(name=f'numeric_minmax_scaler')
... ], name='numeric_or_scaler'
... )
... num_pipeline = Pipeline([scaler_choice], name='numeric_pipeline', columns=column_number_exclude_timedelta)(input)
... union_pipeline = DataFrameMapper(default=None, input_df=True, df_out=True)([num_pipeline])
... xgb_est = XGBoostEstimator(fit_kwargs={}, class_balancing='ClassWeight') # Use class balancing
... ModuleChoice([xgb_est], name='estimator_options')(union_pipeline)
... space.set_inputs(input)
... return space
>>> from hypergbm import HyperGBM
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>>
>>> rs = EvolutionSearcher(search_space, 200, 100, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> hk.search(X_train, y_train, X_eval=X_test, y_eval=y_test)
Trial No. Reward Elapsed Space Vector
0 1 1.0 0.100520 [0]
1 2 1.0 0.083927 [1]
How-To¶
How to install shap on centos7?¶
Install system dependencies
yum install epel-release centos-release-scl -y && yum clean all && yum make cache # llvm9.0 is in epel, gcc9 in scl yum install -y llvm9.0 llvm9.0-devel python36-devel devtoolset-9-gcc devtoolset-9-gcc-c++ make cmake
Configure install environment
whereis llvm-config-9.0-64 # find your `llvm-config` path # llvm-config-9: /usr/bin/llvm-config-9.0-64 export LLVM_CONFIG=/usr/bin/llvm-config-9.0-64 # set to your path scl enable devtoolset-9 bash
Install shap
pip3 -v install numpy==1.19.1 # prepare shap dependency pip3 -v install scikit-learn==0.23.1 # prepare shap dependency pip3 -v install shap==0.28.5
If it is very slow to download dependencies package of shap, consider using faster PIP and setuptools mirros. Take using the mirror provided by aliyun as an example, Create file ~/.pip/pip.conf
with content:
[global]
index-url = https://mirrors.aliyun.com/pypi/simple
Continue create file ~/.pydistutils.cfg
with content:
[easy_install]
index_url = https://mirrors.aliyun.com/pypi/simple
Release Note¶
Version 0.2.0¶
This release add following new features:
- Feature engineering
Feature generation
Feature selection
- Data clean
Special empty value handing
Correct data type
Id-ness features cleanup
Duplicate features cleanup
Empty label rows cleanup
Illegal values replacement
Constant features cleanup
Collinearity features cleanup
- Data set split
Adversarial validation
- Modeling algorithms
XGBoost
Catboost
LightGBM
HistGridientBoosting
- Training
Task inference
Command-line tools
- Evaluation strategies:
Cross-validation
Train-Validation-Holdout
- Search strategies
Monte Carlo Tree Search
Evolution
Random search
- Imbalance data
Class Weight
Under-Samping - Near miss - Tomeks links - Random
Over-Samping - SMOTE - ADASYN - Random
- Early stopping strategies
max_no_improvement_trials
time_limit
expected_reward
- Advance features:
Two stage search - Pseudo label - Feature selection
Concept drift handling
Ensemble