Quick Start¶
The purpose of this guide is to illustrate some of the main features that hypergbm provides. It assumes a basic working knowledge of machine learning practices (dataset, model fitting, predicting, cross-validation, etc.). Please refer to installation instructions for installing hypergbm; You can use hypergbm through the python API and command line tools
Using Python API¶
This section will show you how to train a binary model using hypergbm.
You can use util provided by tabular_toolbox
to read Bank Marketing dataset:
>>> from tabular_toolbox.datasets import dsutils
>>> df = dsutils.load_bank()
>>> df[:3]
id age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
Then we split the data into training set and test set to train and evaluate the model:
>>> from sklearn.model_selection import train_test_split
>>> y = df.pop('y') # target col is "y"
>>> X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=9527)
Hypergbm provides a variety of search strategies. Here, the random search strategy is used to train in the built-in search space:
>>> from hypernets.searchers import RandomSearcher
>>> from hypernets.core import OptimizeDirection
>>> from hypergbm.search_space import search_space_general
>>> rs = RandomSearcher(space_fn=search_space_general,
... optimize_direction=OptimizeDirection.Maximize)
>>> rs
<hypernets.searchers.random_searcher.RandomSearcher object at 0x10e5b9850>
Parameters space_fn
is used to specify the search space;
Meric AUC is used here, and set optimize_direction=OptimizeDirection.Maximize
means the larger the value of the metric, the better .
Then use the Experiment API to train the model:
>>> from hypergbm import HyperGBM, CompeteExperiment
>>> hk = HyperGBM(rs, reward_metric='auc', cache_dir=f'hypergbm_cache', callbacks=[])
>>> experiment = CompeteExperiment(hk, X_train, y_train, X_test=X_test)
19:19:31 I hypergbm.experiment.py 714 - create experiment with ['data_clean', 'drift_detected', 'base_search_and_train']
>>> pipeline = experiment.run(use_cache=True, max_trials=2)
...
Trial No. Reward Elapsed Space Vector
0 1 0.994731 8.490173 [2, 0, 1, 2, 0, 0, 0]
1 2 0.983054 4.980630 [1, 2, 1, 2, 215, 3, 0, 0, 4, 3]
>>> pipeline
Pipeline(steps=[('data_clean',
DataCleanStep(data_cleaner_args={}, name='data_clean',
random_state=9527)),
('drift_detected', DriftDetectStep(name='drift_detected')),
('base_search_and_train',
BaseSearchAndTrainStep(name='base_search_and_train',
scorer=make_scorer(log_loss, greater_is_better=False, needs_proba=True))),
('estimator',
<tabular_toolbox.ensemble.voting.GreedyEnsemble object at 0x1a24ca00d0>)])
After the training experiment, let’s evaluate the model:
>>> y_proba = pipeline.predict_proba(X_test)
>>> metrics.roc_auc_score(y_test, y_proba[:, 1])
0.9956872713648863
Using command-line tools¶
HyperGBM also provides command-line tools to train model and predict data, view the help doc:
hypergm -h
usage: hypergbm [-h] --train_file TRAIN_FILE [--eval_file EVAL_FILE]
[--eval_size EVAL_SIZE] [--test_file TEST_FILE] --target
TARGET [--pos_label POS_LABEL] [--max_trials MAX_TRIALS]
[--model_output MODEL_OUTPUT]
[--prediction_output PREDICTION_OUTPUT] [--searcher SEARCHER]
...
Similarly, taking the training Bank Marketing as an example, we first split the data set into training set and test set and generate the CSV file for command-line tools:
>>> from tabular_toolbox.datasets import dsutils
>>> from sklearn.model_selection import train_test_split
>>> df = dsutils.load_bank()
>>> df_train, df_test = train_test_split(df, test_size=0.3, random_state=9527)
>>> df_train.to_csv('bank_train.csv', index=None)
>>> df_test.to_csv('bank_test.csv', index=None)
The generated CSV files is used as the training command parameters then execute the command::
hypergbm --train_file=bank_train.csv --test_file=bank_test.csv --target=y --pos_label=yes --model_output=model.pkl prediction_output=bank_predict.csv
...
Trial No. Reward Elapsed Space Vector
0 10 1.000000 64.206514 [0, 0, 1, 3, 2, 1, 2, 1, 2, 2, 3]
1 7 0.999990 2.433192 [1, 1, 1, 2, 215, 0, 2, 3, 0, 4]
2 4 0.999950 37.057761 [0, 3, 1, 0, 2, 1, 3, 1, 3, 4, 3]
3 9 0.967292 9.977973 [1, 0, 1, 1, 485, 2, 2, 5, 3, 0]
4 1 0.965844 4.304114 [1, 2, 1, 1, 60, 2, 2, 5, 0, 1]
After the training, the model will be persisted to file model.pkl
and the prediction results will be saved to bank_predict.csv
.