Basic Usages

In this section, we are going to provide an example to show how to train a model using the tool make_experiment. In this example, we use the blood dataset, which is loaded from hypernets.tabular. The columns of this dataset can be shown as follows:

Recency,Frequency,Monetary,Time,Class
2,50,12500,98,1
0,13,3250,28,1
1,16,4000,35,1
2,20,5000,45,1
1,24,6000,77,0
4,4,1000,4,0

...

Create and Run an Experiment

Using the tool make_experiment can create an executable experiment object. The only required parameter of this tool is train_data. Then simply calling the method run of the created experiment object will start training and return a model. Note that if the target column of the data is not y, one needs to manually set it through the parameter target.

An example code:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class')
estimator = experiment.run()
print(estimator)

output:

Pipeline(steps=[('data_clean',
                 DataCleanStep(...),
                ('estimator',
                 GreedyEnsemble(...)])

Training will return a Pipeline while the final returned model is a collection of multiple models.

For training data with file extension .csv or .parquet, the experiment can be created through using the data file path directly and make_experiment will load data as DataFrame automatically. For an example:

from hypergbm import make_experiment

train_data = '/path/to/mydata.csv'
experiment = make_experiment(train_data, target='my_target')
estimator = experiment.run()
print(estimator)

Set the Number of Search Trials

One can set the max search trial number by adjusting max_trials. The following code sets the max searching time as 100:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', max_trials=100)
estimator = experiment.run()
print(estimator)

Use Cross Validation

Users can apply cross validation in the experiment by manually setting parameter cv. Setting cv=False means the experiment will not apply cross validation but applying train_test_split. On the other hand, when cv=True, the experiment will apply cross validation. And the number of folds can be adjusted through the parameter num_folds, whose default value is 3.

Example code when cv=True:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=True, num_folds=5)
estimator = experiment.run()
print(estimator)

Evaluation dataset

When cv=False, the experiment object will additionally require evaluating its perfomance on the evaluation dataset. This can be done by setting eval_data when creating make_experiment. For example:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
from sklearn.model_selection import train_test_split

train_data = dsutils.load_blood()
train_data,eval_data=train_test_split(train_data,test_size=0.3)
experiment = make_experiment(train_data, target='Class', eval_data=eval_data, cv=False)
estimator = experiment.run()
print(estimator)

If the eval_data is not given, the experiment object will split the train_data to obtain an evaluation dataset, whose size can be adjusted by setting eval_size:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', cv=False, eval_size=0.2)
estimator = experiment.run()
print(estimator)

Set the Evaluation Criterion

The default evaluation criterion of the experiment object for classification task is accuracy, while for regression task is rmse. Other criterions can be set through reward_metric. For example:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', reward_metric='auc')
estimator = experiment.run()
print(estimator)

Set the Early Stopping

One can set the early stopping strategy with settings of early_stopping_rounds, early_stopping_time_limit and early_stopping_reward.

The following code sets the max searching time as 3 hours:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', max_trials=300, early_stopping_time_limit=3600 * 3)
estimator = experiment.run()
print(estimator)

Choose a Searcher

HyperGBM performs hyperparameter search using the search algorithms provided by Hypernets, which includes EvolutionSearch, MCTSSearcher, and RandomSearcher. One can choose a specific searcher by setting the parameter searcher when using make_experiment.

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', searcher='random')
estimator = experiment.run()
print(estimator)

Furthermore, you can make a new searcher object for experiment, for an example:

from hypergbm import make_experiment
from hypergbm.search_space import search_space_general
from hypernets.searchers import MCTSSearcher
from hypernets.tabular.datasets import dsutils

my_searcher = MCTSSearcher(lambda: search_space_general(n_estimators=100),
                           max_node_space=20,
                           optimize_direction='max')

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', searcher=my_searcher)
estimator = experiment.run()
print(estimator)

Enable TrialStore

In HyperGBM experiments, one can save trial information into trail store. For the same dataset, HyperGBM will reuse the trial results if it was found from the trial store. Enable trail store with option trial_store.

train_data = ...
experiment = make_experiment(train_data, trial_store='/tmp/trial_store', ...)

Ensemble Models

make_experiment automatically turns on the model ensemble function to achieve a better model. It will ensemble the best 20 models while the number for ensembling can be changed by setting ensemble_size as the following code, where ensemble_size=0 means no ensembling wii be made.

train_data = ...
experiment = make_experiment(train_data, ensemble_size=10, ...)

Set Parallelism

By default, HyperGBM will use all CPUs in the parallel computing. One can control the number of threads or processes with option n_jobs.

train_data = ...
experiment = make_experiment(train_data, n_jobs=10, ...)

Set Log Levels

The progress messages during training can be printed by setting log_level (str or int). Please refer the logging package of python for further details. Besides, more comprehensive messages will be printed when setting verbose as 1.

The following codes sets the log level to ‘INFO’:

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', log_level='INFO', verbose=1)
estimator = experiment.run()
print(estimator)

Output:

14:24:33 I hypernets.tabular.u._common.py 30 - 2 class detected, {0, 1}, so inferred as a [binary classification] task
14:24:33 I hypergbm.experiment.py 699 - create experiment with ['data_clean', 'drift_detection', 'space_search', 'final_ensemble']
14:24:33 I hypergbm.experiment.py 1262 - make_experiment with train data:(748, 4), test data:None, eval data:None, target:Class
14:24:33 I hypergbm.experiment.py 716 - fit_transform data_clean
14:24:33 I hypergbm.experiment.py 716 - fit_transform drift_detection
14:24:33 I hypergbm.experiment.py 716 - fit_transform space_search
14:24:33 I hypernets.c.meta_learner.py 22 - Initialize Meta Learner: dataset_id:7123e0d8c8bbbac8797ed9e42352dc59
14:24:33 I hypernets.c.callbacks.py 192 - 
Trial No:1
--------------------------------------------------------------
(0) estimator_options.hp_or:                                0
(1) numeric_imputer_0.strategy:                 most_frequent
(2) numeric_scaler_optional_0.hp_opt:                    True


...

14:24:35 I hypergbm.experiment.py 716 - fit_transform final_ensemble
14:24:35 I hypergbm.experiment.py 737 - trained experiment pipeline: ['data_clean', 'estimator']
Pipeline(steps=[('data_clean',
                 DataCleanStep(...),
                ('estimator',
                 GreedyEnsemble(...)

Experiment Visualization

HyperGBM supports user interface based on webpage by setting the argument webui=True, where you see all the processing and parameters information displayed in a dashboard.

Note: This function requires to install hypergbm with the command:

pip install hypergbm[board]

The example codes of enabling experiment visualization based on website is shown below:

from sklearn.model_selection import train_test_split

from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils

df = dsutils.load_bank()
df_train, df_test = train_test_split(df, test_size=0.8, random_state=42)

experiment = make_experiment(df_train, target='y', webui=True)
estimator = experiment.run(max_trials=10)

print(estimator)

The output is:

02-17 19:08:48 I hypernets.t.estimator_detector.py 85 - EstimatorDetector error: GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1
...
server is running at: 0.0.0.0:8888 
...

02-17 19:08:55 I hypernets.t.metrics.py 153 - calc_score ['auc', 'accuracy'], task=binary, pos_label=yes, classes=['no' 'yes'], average=None
final result:{'auc': 0.8913467492260062, 'accuracy': 0.8910699474702792}

Then you could see the experiment progress dashboard by accessing the web server http://localhost:8888. One screenshot is displayed below: _images/experiment-web-visualization.png

It also support other options to configure the webui: defining the file directory by event_file_dir, setting the server port by server_port, and defining if exiting the web server after finishing the current experiment by exit_web_server_on_finish. See example:

...
webui_options = {
    'event_file_dir': "./events",  # persist experiment running events log to './events'
    'server_port': 8888, # http server port
    'exit_web_server_on_finish': False  # exit http server after experiment finished
}
experiment = make_experiment(df_train, target='y', webui=True, webui_options=webui_options)
...