Experiment Examples

Basic Usages

In this chapter we’ll show how to train models with HyperGBM experiment, we’ll use the blood dataset in the following examples,Class is the target feature.

Recency,Frequency,Monetary,Time,Class
2,50,12500,98,1
0,13,3250,28,1
1,16,4000,35,1
2,20,5000,45,1
1,24,6000,77,0
4,4,1000,4,0

...

Use experiment with default settings

User can create experiment instance with the python tool make_experiment and run it quickly。train_data is the only required parameter, all others are optional. The target is also required if your target feature name isn’t y

Codes:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class')
estimator = experiment.run()
print(estimator)

Outputs:

Pipeline(steps=[('data_clean',
                 DataCleanStep(...),
                ('estimator',
                 GreedyEnsemble(...)])

Process finished with exit code 0

As the console output, the trained model is a pipeline object,the estimator is ensembled by several other models。

If your training data files are .csv or .parquet files,user can call make_experiment with the file path directly,like the following:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = '/path/to/mydata.csv'
experiment = make_experiment(train_data, target='my_target')
estimator = experiment.run()
print(estimator)

Cross Validation

make_experiment enable cross validation as default, user can disable it by set cv= False. Use can change cross fold number with num_folds, just like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=True, num_folds=5)
estimator = experiment.run()
print(estimator)

Setup evaluate data (eval_data)

Experiment split evaluate data from train_data by default if cross validation is disabled, user can customize it with eval_data like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
from sklearn.model_selection import train_test_split

train_data = dsutils.load_blood()
train_data,eval_data=train_test_split(train_data,test_size=0.3)
experiment = make_experiment(train_data, target='Class', eval_data=eval_data, cv=False)
estimator = experiment.run()
print(estimator)

If eval_data is None and cv is False, the experiment will split evaluation data from train_data, user can change evaluation data size with eval_size, like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', cv=False, eval_size=0.2)
estimator = experiment.run()
print(estimator)

Setup search reward metric

The default search reward metric is accuracy,user can change it with reward_metric, like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', reward_metric='auc')
estimator = experiment.run()
print(estimator)

Change search trial number and setup early stopping

User can limit search trial number with max_trials,and setup search early stopping with early_stopping_round, early_stopping_time_limit, early_stopping_reward. like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', max_trials=30, early_stopping_time_limit=3600 * 3)
estimator = experiment.run()
print(estimator)

Drift detection

To enable the feature drift detection, set drift_detection=True, and set test_data with the testing data, like this:

from io import StringIO
import pandas as pd
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

test_data = """
Recency,Frequency,Monetary,Time
2,10,2500,64
4,5,1250,23
4,9,2250,46
4,5,1250,23
4,8,2000,40
2,12,3000,82
11,24,6000,64
2,7,1750,46
4,11,2750,61
1,7,1750,57
2,11,2750,79
2,3,750,16
4,5,1250,26
2,6,1500,41
"""

train_data = dsutils.load_blood()
test_df = pd.read_csv(StringIO(test_data))
experiment = make_experiment(train_data, test_data=test_df, target='Class', drift_detection=True)
estimator = experiment.run()
print(estimator)

Multicollinearity detection

To enable multicollinearity detection, set collinearity_detection=True, like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', `collinearity_detection=True)
estimator = experiment.run()
print(estimator)

Pseudo labeling

To enable pseudo labeling with two stage searching, set pseudo_labeling=True, like this:

train_data=...
experiment = make_experiment(train_data, pseudo_labeling=True, ...)

Permutation importance feature selection

To enable feature selection by permutation importance with two stage searching, set feature_reselection=True, like this:

train_data=...
experiment = make_experiment(train_data, feature_reselection=True, ...)

Ensemble

To change estimator number for ensemble, set ensemble_size to expected number. Or set ensemble_size=0 to disable ensemble.

train_data = ...
experiment = make_experiment(train_data, ensemble_size=10, ...)

Logging settings

To change logging level, set log_level with log level defined in python logging utility.

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', log_level='INFO', verbose=1)
estimator = experiment.run()
print(estimator)

Outputs:

14:24:33 I tabular_toolbox.u._common.py 30 - 2 class detected, {0, 1}, so inferred as a [binary classification] task
14:24:33 I hypergbm.experiment.py 699 - create experiment with ['data_clean', 'drift_detection', 'space_search', 'final_ensemble']
14:24:33 I hypergbm.experiment.py 1262 - make_experiment with train data:(748, 4), test data:None, eval data:None, target:Class
14:24:33 I hypergbm.experiment.py 716 - fit_transform data_clean
14:24:33 I hypergbm.experiment.py 716 - fit_transform drift_detection
14:24:33 I hypergbm.experiment.py 716 - fit_transform space_search
14:24:33 I hypernets.c.meta_learner.py 22 - Initialize Meta Learner: dataset_id:7123e0d8c8bbbac8797ed9e42352dc59
14:24:33 I hypernets.c.callbacks.py 192 - 
Trial No:1
--------------------------------------------------------------
(0) estimator_options.hp_or:                                0
(1) numeric_imputer_0.strategy:                 most_frequent
(2) numeric_scaler_optional_0.hp_opt:                    True


...

14:24:35 I hypergbm.experiment.py 716 - fit_transform final_ensemble
14:24:35 I hypergbm.experiment.py 737 - trained experiment pipeline: ['data_clean', 'estimator']
Pipeline(steps=[('data_clean',
                 DataCleanStep(...),
                ('estimator',
                 GreedyEnsemble(...)

Process finished with exit code 0

Advanced Usages

Customize Searcher and Search Space

User can customize searcher and search space with searcher and search_space, like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
from hypergbm.search_space import search_space_general


def my_search_space():
    return search_space_general(n_esitimators=100)


train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', searcher='random', search_space=my_search_space)
estimator = experiment.run()
print(estimator)

Or like this:

from hypergbm import make_experiment
from hypergbm.search_space import search_space_general
from hypernets.searchers import MCTSSearcher
from tabular_toolbox.datasets import dsutils

my_searcher = MCTSSearcher(lambda: search_space_general(n_esitimators=100),
                           max_node_space=20,
                           optimize_direction='max')

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', searcher=my_searcher)
estimator = experiment.run()
print(estimator)

Use CompeteExperiment

Use can create experiment with class CompeteExperiment for more details.

from hypergbm import HyperGBM, CompeteExperiment
from hypergbm.search_space import search_space_general
from hypernets.core.callbacks import EarlyStoppingCallback, SummaryCallback
from hypernets.searchers import EvolutionSearcher
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()


def my_search_space():
    return search_space_general(early_stopping_rounds=10, verbose=0, cat_pipeline_mode='complex')


searcher = EvolutionSearcher(my_search_space,
                             optimize_direction='max', population_size=30, sample_size=10,
                             regularized=True, candidates_size=10)

es = EarlyStoppingCallback(time_limit=3600 * 3, mode='max')
hm = HyperGBM(searcher, reward_metric='auc', cache_dir=f'hypergbm_cache', clear_cache=True,
              callbacks=[es, SummaryCallback()])

X = train_data
y = train_data.pop('Class')
experiment = CompeteExperiment(hm, X, y, eval_size=0.2,
                               cv=True, pseudo_labeling=False,
                               max_trials=20, use_cache=True)
estimator = experiment.run()
print(estimator)

Distribution with Dask

Quick Start

To run HyperGBM experiment with Dask cluster, use need to setup the default Dask client before call make_experiment, like this:

from dask.distributed import LocalCluster, Client

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils


def train():
    cluster = LocalCluster(processes=True)
    client = Client(cluster)

    train_data = '/opt/data/my_data.csv'

    experiment = make_experiment(train_data, target='...')
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()

User can also use dask.dataframe load training data set Dask DataFrame to create experiment:

from dask import dataframe as dd
from dask.distributed import LocalCluster, Client

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils


def train():
    cluster = LocalCluster(processes=False)
    client = Client(cluster)

    train_data = dd.from_pandas(dsutils.load_blood(), npartitions=1)

    experiment = make_experiment(train_data, target='Class')
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()

Reference Dask Create DataFrames for more details

Customize Search Space

To run experiment with Dask cluster, all transformers and estimators must support Dask objects, reference hypergbm.dask.search_space.search_space_general for more details to customize search space pls。

from dask import dataframe as dd
from dask.distributed import LocalCluster, Client

from hypergbm import make_experiment
from hypergbm.dask.search_space import search_space_general
from tabular_toolbox.datasets import dsutils


def my_search_space():
    return search_space_general(n_esitimators=100)


def train():
    cluster = LocalCluster(processes=False)
    client = Client(cluster)

    train_data = dd.from_pandas(dsutils.load_blood(), npartitions=1)

    experiment = make_experiment(train_data, target='Class', searcher='mcts', search_space=my_search_space)
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()