Basic Applications
In this section, we are going to provide an example to show how to train a model using the experiment. In this example, we use the blood
dataset, which is loaded from hypernets.tabular
. The columns of this dataset can be shown as follows:
Recency,Frequency,Monetary,Time,Class
2,50,12500,98,1
0,13,3250,28,1
1,16,4000,35,1
2,20,5000,45,1
1,24,6000,77,0
4,4,1000,4,0
...
Create and Run an Experiment
Using the tool make_experiment
can create an executable experiment object. The only necessary parameter when using this tool is train_data
. Then simply calling the method run
of the created experiment object will start training and return a model. Note that if the target column of the data is not y
, one needs to manually set it through the parameter target
.
An example code:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class')
estimator = experiment.run()
print(estimator)
output:
Pipeline(steps=[('data_clean',
DataCleanStep(...),
('estimator',
GreedyEnsemble(...)])
Training will return a Pipeline while the final returned model is a collection of multiple models.
For training data with file extension .csv or .parquet, the experiment can be created through using the data file path directly and make_experiment
will load data as DataFrame automatically. For an example:
from hypergbm import make_experiment
train_data = '/path/to/mydata.csv'
experiment = make_experiment(train_data, target='my_target')
estimator = experiment.run()
print(estimator)
Set the Number of Search Trials
One can set the max search trial number by adjusting max_trials
.
The following code sets the max searching time as 3 hours:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', max_trials=100)
estimator = experiment.run()
print(estimator)
Use Cross Validation
Users can apply cross validation in the experiment by manually setting parameter cv
. Setting cv
as ‘False’ will lead the experiment to avoid using cross validation and apply train_test_split instead. On the other hand, when cv
is True
, the experiment will use cross validation where the number of folds can be adjusted through the parameter num_folds
. The default value of num_folds
is 3.
Example code when cv=True
:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=True, num_folds=5)
estimator = experiment.run()
print(estimator)
Evaluation dataset
When cv=False
, training model will require evaluating its perfomance additionally on evaluation dataset. This can be done by setting eval_data
when creating make_experiment
. For example:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
from sklearn.model_selection import train_test_split
train_data = dsutils.load_blood()
train_data,eval_data=train_test_split(train_data,test_size=0.3)
experiment = make_experiment(train_data, target='Class', eval_data=eval_data, cv=False)
estimator = experiment.run()
print(estimator)
If the eval_data
is not given, the experiment object will split the train_data
to get an evaluation dataset, whose size can be adjusted by setting eval_size
:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=False, eval_size=0.2)
estimator = experiment.run()
print(estimator)
Set the Evaluation Criterion
The default evaluation criterion of the model when creating an experiment with make_experiment
for classification task is accuracy
, while the criterion for regression task is rmse
. Other criterions can be used by setting reward_metric
. For example:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', reward_metric='auc')
estimator = experiment.run()
print(estimator)
Set the Early Stopping
One can set the early stopping strategy with settings of early_stopping_round
, early_stopping_time_limit
and early_stopping_reward
.
The following code sets the max searching time as 3 hours:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', max_trials=300, early_stopping_time_limit=3600 * 3)
estimator = experiment.run()
print(estimator)
Choose a Searcher
HyperGBM performs hyperparameter search with the search algorithms provided by Hypernets, which includes EvolutionSearch, MCTSSearcher, RandomSearcher. One can choose a specific searcher when using make_experiment
by setting the parameter searcher
.
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', searcher='random')
estimator = experiment.run()
print(estimator)
Furthermore, you can make a new searcher object for experiment, for an example:
from hypergbm import make_experiment
from hypergbm.search_space import search_space_general
from hypernets.searchers import MCTSSearcher
from hypernets.tabular.datasets import dsutils
my_searcher = MCTSSearcher(lambda: search_space_general(n_estimators=100),
max_node_space=20,
optimize_direction='max')
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', searcher=my_searcher)
estimator = experiment.run()
print(estimator)
Ensemble Models
make_experiment
automatically turns on the model ensemble function to get a better model when created. It will ensemble the best 20 models while the number for ensembling can be changed by setting ensemble_size
as the following code, where ensemble_size=0
means no ensembling wii be made.
train_data = ...
experiment = make_experiment(train_data, ensemble_size=10, ...)
Change the log level
The progress messages during training can be shown by setting log_level
(str
or int
) to change the log level. Please refer the logging
package of python for further details. Besides, more thorough messages will show when verobs
is set as 1
.
The following codes sets the log level to ‘INFO’:
from hypergbm import make_experiment
from hypernets.tabular.datasets import dsutils
train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', log_level='INFO', verbose=1)
estimator = experiment.run()
print(estimator)
Output:
14:24:33 I hypernets.tabular.u._common.py 30 - 2 class detected, {0, 1}, so inferred as a [binary classification] task
14:24:33 I hypergbm.experiment.py 699 - create experiment with ['data_clean', 'drift_detection', 'space_search', 'final_ensemble']
14:24:33 I hypergbm.experiment.py 1262 - make_experiment with train data:(748, 4), test data:None, eval data:None, target:Class
14:24:33 I hypergbm.experiment.py 716 - fit_transform data_clean
14:24:33 I hypergbm.experiment.py 716 - fit_transform drift_detection
14:24:33 I hypergbm.experiment.py 716 - fit_transform space_search
14:24:33 I hypernets.c.meta_learner.py 22 - Initialize Meta Learner: dataset_id:7123e0d8c8bbbac8797ed9e42352dc59
14:24:33 I hypernets.c.callbacks.py 192 -
Trial No:1
--------------------------------------------------------------
(0) estimator_options.hp_or: 0
(1) numeric_imputer_0.strategy: most_frequent
(2) numeric_scaler_optional_0.hp_opt: True
...
14:24:35 I hypergbm.experiment.py 716 - fit_transform final_ensemble
14:24:35 I hypergbm.experiment.py 737 - trained experiment pipeline: ['data_clean', 'estimator']
Pipeline(steps=[('data_clean',
DataCleanStep(...),
('estimator',
GreedyEnsemble(...)