DataCanvas

HyperGBM is an open source project created by DataCanvas .

Overview

What is HyperGBM

HyperGBM is a library that supports full-pipeline AutoML, which completely covers the end-to-end stages of data cleaning, preprocessing, feature generation and selection, model selection and hyperparameter optimization.It is a real-AutoML tool for tabular data.

Unlike most AutoML approaches that focus on tackling the hyperparameter optimization problem of machine learning algorithms, HyperGBM can put the entire process from data cleaning to algorithm selection in one search space for optimization. End-to-end pipeline optimization is more like a sequential decision process, thereby HyperGBM uses reinforcement learning, Monte Carlo Tree Search, evolution algorithm combined with a meta-learner to efficiently solve such problems. As the name implies, the ML algorithms used in HyperGBM are all GBM models, and more precisely the gradient boosting tree model, which currently includes XGBoost, LightGBM and Catboost. The underlying search space representation and search algorithm in HyperGBM are powered by the Hypernets project a general AutoML framework.

Main components

In this section, we briefly cover the main components in HyperGBM. As shown below:

_images/hypergbm-main-components.png
  • HyperGBM(HyperModel)

    HyperGBM is a specific implementation of HyperModel (for HyperModel, please refer to the Hypernets project). It is the core interface of the HyperGBM project. By calling the search method to explore and return the best model in the specified Search Space with the specified Searcher.

  • Search Space

    Search spaces are constructed by arranging ModelSpace(transformer and estimator), ConnectionSpace(pipeline) and ParameterSpace(hyperparameter). The transformers are chained together by pipelines while the pipelines can be nested. The last node of a search space must be an estimator. Each transformer and estimator can define a set of hyperparameterss.

_images/hypergbm-search-space.png

The code example of Numeric Pipeline is as follows:

import numpy as np
from hypergbm.pipeline import Pipeline
from hypergbm.sklearn.transformers import SimpleImputer, StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, LogStandardScaler
from hypernets.core.ops import ModuleChoice, Optional, Choice
from tabular_toolbox.column_selector import  column_number_exclude_timedelta


def numeric_pipeline_complex(impute_strategy=None, seq_no=0):
    if impute_strategy is None:
        impute_strategy = Choice(['mean', 'median', 'constant', 'most_frequent'])
    elif isinstance(impute_strategy, list):
        impute_strategy = Choice(impute_strategy)

    imputer = SimpleImputer(missing_values=np.nan, strategy=impute_strategy, name=f'numeric_imputer_{seq_no}',
                            force_output_as_float=True)
    scaler_options = ModuleChoice(
        [
            LogStandardScaler(name=f'numeric_log_standard_scaler_{seq_no}'),
            StandardScaler(name=f'numeric_standard_scaler_{seq_no}'),
            MinMaxScaler(name=f'numeric_minmax_scaler_{seq_no}'),
            MaxAbsScaler(name=f'numeric_maxabs_scaler_{seq_no}'),
            RobustScaler(name=f'numeric_robust_scaler_{seq_no}')
        ], name=f'numeric_or_scaler_{seq_no}'
    )
    scaler_optional = Optional(scaler_options, keep_link=True, name=f'numeric_scaler_optional_{seq_no}')
    pipeline = Pipeline([imputer, scaler_optional],
                        name=f'numeric_pipeline_complex_{seq_no}',
                        columns=column_number_exclude_timedelta)
    return pipeline
  • Searcher

    Searcher is an algorithm used to explore a search space.It encompasses the classical exploration-exploitation trade-off since, on the one hand, it is desirable to find well-performing model quickly, while on the other hand, premature convergence to a region of suboptimal solutions should be avoided. Three algorithms are provided in HyperGBM: MCTSSearcher (Monte-Carlo tree search), EvolutionarySearcher and RandomSearcher.

  • HyperGBMEstimator

    HyperGBMEstimator is an object built from a sample in the search space, including the full preprocessing pipeline and a GBM model. It can be used to fit on training data, evaluate on evaluation data, and predict on new data.

  • CompeteExperiment

    CompeteExperiment is a powerful tool provided in HyperGBM. It not only performs pipeline search, but also contains some advanced features to further improve the model performance such as data drift handling, pseudo-labeling, ensemble, etc.

Feature matrix

There are 3 training modes:

  • Standalone

  • Distributed on single machine

  • Distributed on multiple machines

Here is feature matrix of training modes:

#

Feature

Standalone

Distributed on single machine

Distributed on multiple machines

Feature engineering

Feature generation
Feature dimension reduction


Data clean

Correct data type
Special empty value handing
Id-ness features cleanup
Duplicate features cleanup
Empty label rows cleanup
Illegal values replacement
Constant features cleanup
Collinearity features cleanup

Data set split

Adversarial validation

Modeling algorithms

XGBoost
Catboost
LightGBM
HistGridientBoosting




Training

Task inference
Command-line tools


Evaluation strategies

Cross-validation
Train-Validation-Holdout

Search strategies

Monte Carlo Tree Search
Evolution
Random search

Class balancing

Class Weight
Under-Samping(Near miss,Tomeks links,Random)
Over-Samping(SMOTE,ADASYN,Random)





Early stopping strategies

max_no_improvement_trials
time_limit
expected_reward

Advance features

Two stage search(Pseudo label,Feature selection)
Concept drift handling
Ensemble

Installation

You can use pip or docker to install HyperGBM.

Using pip

It requires Python3.6 or above, and uses pip to install HyperGBM:

pip install --upgrade pip setuptools # (optional)
pip install hypergbm

Install shap(Optional)

HyperGBM provides model interpretation based on shap, you can install it refer to this guide if necessary.

Using Docker

You can also use HyperGBM through our built-in jupyter docker image with command:

docker run -ti -e NotebookToken="your-token" -p 8888:8888 datacanvas/hypergbm:0.2.0

Open browser visit site http://<your-ip>:8888,the token is what you have set “you-token”,it can also be empty if do not specified.

Quick Start

The purpose of this guide is to illustrate some of the main features that hypergbm provides. It assumes a basic working knowledge of machine learning practices (dataset, model fitting, predicting, cross-validation, etc.). Please refer to installation instructions for installing hypergbm; You can use hypergbm through the python API and command line tools

Using Python API

This section will show you how to train a binary model using hypergbm.

You can use util provided by tabular_toolbox to read Bank Marketing dataset:

>>> from tabular_toolbox.datasets import dsutils
>>> df = dsutils.load_bank()
>>> df[:3]
   id  age         job  marital  education default  balance housing loan   contact  day month  duration  campaign  pdays  previous poutcome   y
0   0   30  unemployed  married    primary      no     1787      no   no  cellular   19   oct        79         1     -1         0  unknown  no
1   1   33    services  married  secondary      no     4789     yes  yes  cellular   11   may       220         1    339         4  failure  no
2   2   35  management   single   tertiary      no     1350     yes   no  cellular   16   apr       185         1    330         1  failure  no

Then we split the data into training set and test set to train and evaluate the model:

>>> from sklearn.model_selection import train_test_split
>>> y = df.pop('y')  # target col is "y"
>>> X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=9527)

Hypergbm provides a variety of search strategies. Here, the random search strategy is used to train in the built-in search space:

>>> from hypernets.searchers import RandomSearcher
>>> from hypernets.core import OptimizeDirection
>>> from hypergbm.search_space import search_space_general
>>> rs = RandomSearcher(space_fn=search_space_general,
...                     optimize_direction=OptimizeDirection.Maximize)
>>> rs
<hypernets.searchers.random_searcher.RandomSearcher object at 0x10e5b9850>

Parameters space_fn is used to specify the search space; Meric AUC is used here, and set optimize_direction=OptimizeDirection.Maximize means the larger the value of the metric, the better .

Then use the Experiment API to train the model:

>>> from hypergbm import HyperGBM, CompeteExperiment
>>> hk = HyperGBM(rs, reward_metric='auc', cache_dir=f'hypergbm_cache', callbacks=[])
>>> experiment = CompeteExperiment(hk, X_train, y_train, X_test=X_test)
19:19:31 I hypergbm.experiment.py 714 - create experiment with ['data_clean', 'drift_detected', 'base_search_and_train']
>>> pipeline = experiment.run(use_cache=True, max_trials=2)
...
   Trial No.    Reward   Elapsed                      Space Vector
0          1  0.994731  8.490173             [2, 0, 1, 2, 0, 0, 0]
1          2  0.983054  4.980630  [1, 2, 1, 2, 215, 3, 0, 0, 4, 3]
>>> pipeline
Pipeline(steps=[('data_clean',
                 DataCleanStep(data_cleaner_args={}, name='data_clean',
                               random_state=9527)),
                ('drift_detected', DriftDetectStep(name='drift_detected')),
                ('base_search_and_train',
                 BaseSearchAndTrainStep(name='base_search_and_train',
                                        scorer=make_scorer(log_loss, greater_is_better=False, needs_proba=True))),
                ('estimator',
                 <tabular_toolbox.ensemble.voting.GreedyEnsemble object at 0x1a24ca00d0>)])

After the training experiment, let’s evaluate the model:

>>> y_proba = pipeline.predict_proba(X_test)
>>> metrics.roc_auc_score(y_test, y_proba[:, 1])
0.9956872713648863

Using command-line tools

HyperGBM also provides command-line tools to train model and predict data, view the help doc:

hypergm -h

usage: hypergbm [-h] --train_file TRAIN_FILE [--eval_file EVAL_FILE]
                [--eval_size EVAL_SIZE] [--test_file TEST_FILE] --target
                TARGET [--pos_label POS_LABEL] [--max_trials MAX_TRIALS]
                [--model_output MODEL_OUTPUT]
                [--prediction_output PREDICTION_OUTPUT] [--searcher SEARCHER]
...

Similarly, taking the training Bank Marketing as an example, we first split the data set into training set and test set and generate the CSV file for command-line tools:

>>> from tabular_toolbox.datasets import dsutils
>>> from sklearn.model_selection import train_test_split
>>> df = dsutils.load_bank()
>>> df_train, df_test = train_test_split(df, test_size=0.3, random_state=9527)
>>> df_train.to_csv('bank_train.csv', index=None)
>>> df_test.to_csv('bank_test.csv', index=None)

The generated CSV files is used as the training command parameters then execute the command::

hypergbm --train_file=bank_train.csv --test_file=bank_test.csv --target=y --pos_label=yes --model_output=model.pkl prediction_output=bank_predict.csv

...
   Trial No.    Reward    Elapsed                       Space Vector
0         10  1.000000  64.206514  [0, 0, 1, 3, 2, 1, 2, 1, 2, 2, 3]
1          7  0.999990   2.433192   [1, 1, 1, 2, 215, 0, 2, 3, 0, 4]
2          4  0.999950  37.057761  [0, 3, 1, 0, 2, 1, 3, 1, 3, 4, 3]
3          9  0.967292   9.977973   [1, 0, 1, 1, 485, 2, 2, 5, 3, 0]
4          1  0.965844   4.304114    [1, 2, 1, 1, 60, 2, 2, 5, 0, 1]

After the training, the model will be persisted to file model.pkl and the prediction results will be saved to bank_predict.csv.

HyperGBM

HyperGBM is a specific implementation of HyperModel (for HyperModel, please refer to the Hypernets project). It is the core interface of the HyperGBM project. By calling the search method to explore and return the best model in the specified Search Space with the specified Searcher.

Required Parameters

  • searcher: hypernets.searcher.Searcher, A Searcher instance. hypernets.searchers.RandomSearcher hypernets.searcher.MCTSSearcher hypernets.searchers.EvolutionSearcher

Optinal Parameters

  • dispatcher: hypernets.core.Dispatcher, Dispatcher is used to provide different execution modes for search trials, such as in-process mode (InProcessDispatcher), distributed parallel mode (DaskDispatcher), etc. InProcessDispatcher is used by default.

  • callbacks: list of callback functions or None, optional (default=None), List of callback functions that are applied at each trial. See hypernets.callbacks for more information.

  • reward_metric: str or None, optinal(default=accuracy), Set corresponding metric according to task type to guide search direction of searcher.

  • task: str or None, optinal(default=None), Task type(binary,multiclass or regression). If None, inference the type of task automatically

  • param data_cleaner_params: dict, (default=None), Dictionary of parameters to initialize the DataCleaner instance. If None, DataCleaner will initialized with default values.

  • param cache_dir: str or None, (default=None), Path of data cache. If None, uses ‘working directory/tmp/cache’ as cache dir

  • param clear_cache: bool, (default=True), Whether clear the cache dir before searching

Use case

# import HyperGBM, Search Space and Searcher
from hypergbm import HyperGBM
from hypergbm.search_space import search_space_general
from hypernets.searchers.random_searcher import RandomSearcher
import pandas as pd
from sklearn.model_selection import train_test_split

# instantiate related objects
searcher = RandomSearcher(search_space_general, optimize_direction='max')
hypergbm = HyperGBM(searcher, task='binary', reward_metric='accuracy')

# load data into Pandas DataFrame
df = pd.read_csv('[train_data_file]')
y = df.pop('target')

# split data into train set and eval set
# The evaluation set is used to evaluate the reward of the model fitted with the training set
X_train, X_eval, y_train, y_eval = train_test_split(df, y, test_size=0.3)

# search
hypergbm.search(X_train, y_train, X_eval, y_eval, max_trials=30)

# load best model
best_trial = hypergbm.get_best_trial()
estimator = hypergbm.load_estimator(best_trial.model_file)

# predict on real data
pred = estimator.predict(X_real)

Searchers

MCTSSearcher

Monte-Carlo Tree Search (MCTS) extends the celebrated Multi-armed Bandit algorithm to tree-structured search spaces. The MCTS algorithm iterates over four phases: selection, expansion, playout and backpropagation.

  • Selection: In each node of the tree, the child node is selected after a Multi-armed Bandit strategy, e.g. the UCT (Upper Confidence bound applied to Trees) algorithm.

  • Expansion: The algorithm adds one or more nodes to the tree. This node corresponds to the first encountered position that was not added in the tree.

  • Playout: When reaching the limits of the visited tree, a roll-out strategy is used to select the options until reaching a terminal node and computing the associated reward.

  • Backpropagation: The reward value is propagated back, i.e. it is used to update the value associated to all nodes along the visited path up to the root node.

Code example

from hypernets.searchers import MCTSSearcher

searcher = MCTSSearcher(search_space_fn, use_meta_learner=False, max_node_space=10, candidates_size=10, optimize_direction='max')

Required Parameters

  • space_fn: callable, A search space function which when called returns a HyperSpace instance.

Optinal Parameters

  • policy: hypernets.searchers.mcts_core.BasePolicy, (default=None), The policy for Selection and Backpropagation phases, UCT by default.

  • max_node_space: int, (default=10), Maximum space for node expansion

  • use_meta_learner: bool, (default=True), Meta-learner aims to evaluate the performance of unseen samples based on previously evaluated samples. It provides a practical solution to accurately estimate a search branch with many simulations without involving the actual training.

  • candidates_size: int, (default=10), The number of samples for the meta-learner to evaluate candidate paths when roll out

  • optimize_direction: ‘min’ or ‘max’, (default=’min’), Whether the search process is approaching the maximum or minimum reward value.

  • space_sample_validation_fn: callable or None, (default=None), Used to verify the validity of samples from the search space, and can be used to add specific constraint rules to the search space to reduce the size of the space.

EvolutionSearcher

Evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions (see also loss function). Evolution of the population then takes place after the repeated application of the above operators.

Code example

from hypernets.searchers import EvolutionSearcher

searcher = EvolutionSearcher(search_space_fn, population_size=20, sample_size=5, optimize_direction='min')

Required Parameters

  • space_fn: callable, A search space function which when called returns a HyperSpace instance

  • population_size: int, Size of population

  • sample_size: int, The number of parent candidates selected in each cycle of evolution

Optinal Parameters

  • regularized: bool, (default=False), Whether to enable regularized

  • use_meta_learner: bool, (default=True), Meta-learner aims to evaluate the performance of unseen samples based on previously evaluated samples. It provides a practical solution to accurately estimate a search branch with many simulations without involving the actual training.

  • candidates_size: int, (default=10), The number of samples for the meta-learner to evaluate candidate paths when roll out

  • optimize_direction: ‘min’ or ‘max’, (default=’min’), Whether the search process is approaching the maximum or minimum reward value.

  • space_sample_validation_fn: callable or None, (default=None), Used to verify the validity of samples from the search space, and can be used to add specific constraint rules to the search space to reduce the size of the space.

RandomSearcher

As its name suggests, Random Search uses random combinations of hyperparameters. Code example

from hypernets.searchers import RandomSearcher

searcher = RandomSearcher(search_space_fn, optimize_direction='min')

Required Parameters

  • space_fn: callable, A search space function which when called returns a HyperSpace instance

Optinal Parameters

  • optimize_direction: ‘min’ or ‘max’, (default=’min’), Whether the search process is approaching the maximum or minimum reward value.

  • space_sample_validation_fn: callable or None, (default=None), Used to verify the validity of samples from the search space, and can be used to add specific constraint rules to the search space to reduce the size of the space.

Search Space

Build-in Search Space

Code example

from hypergbm.search_space import search_space_general

searcher = RandomSearcher(search_space_general, optimize_direction='min')
# or 
searcher = RandomSearcher(lambda: search_space_general(n_estimators=300, early_stopping_rounds=10, verbose=0), optimize_direction='min')

Custom Search Space

Code example


CompeteExperiment

There are still many challenges in the machine learning modeling process for tabular data, such as imbalanced data, data drift, poor generalization ability, etc. This challenges cannot be completely solved by pipeline search, so we introduced in HyperGBM a more powerful tool is CompeteExperiment.

CompteExperiment is composed of a series of steps and Pipeline Search is just one step. It also includes advanced steps such as data cleaning, data drift handling, two-stage search, ensemble etc., as shown in the figure below: _images/hypergbm-competeexperiment.png

Code example

from hypergbm import make_experiment
from hypergbm.search_space import search_space_general
import pandas as pd
import logging

# load data into Pandas DataFrame
df = pd.read_csv('[train_data_file]')
target = 'target'

#create an experiment
experiment = make_experiment(df, target=target, 
                 search_space=lambda: search_space_general(class_balancing='SMOTE',n_estimators=300, early_stopping_rounds=10, verbose=0),
                 collinearity_detection=False,
                 drift_detection=True,
                 feature_reselection=False,
                 feature_reselection_estimator_size=10,
                 feature_reselection_threshold=1e-5,
                 ensemble_size=20,
                 pseudo_labeling=False,
                 pseudo_labeling_proba_threshold=0.8,
                 pseudo_labeling_resplit=False,
                 retrain_on_wholedata=False,
                 log_level=logging.ERROR,)

#run experiment
estimator = experiment.run()

# predict on real data
pred = estimator.predict(X_real)

Required Parameters

  • hyper_model: hypergbm.HyperGBM, A HyperGBM instance

  • X_train: Pandas or Dask DataFrame, Feature data for training

  • y_train: Pandas or Dask Series, Target values for training

Optinal Parameters

  • X_eval: (Pandas or Dask DataFrame) or None, (default=None), Feature data for evaluation

  • y_eval: (Pandas or Dask Series) or None, (default=None), Target values for evaluation

  • X_test: (Pandas or Dask Series) or None, (default=None), Unseen data without target values for semi-supervised learning

  • eval_size: float or int, (default=None), Only valid when X_eval or y_eval is None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the eval split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size.

  • train_test_split_strategy: ‘adversarial_validation’ or None, (default=None), Only valid when X_eval or y_eval is None. If None, use eval_size to split the dataset, otherwise use adversarial validation approach.

  • cv: bool, (default=False), If True, use cross-validation instead of evaluation set reward to guide the search process

  • num_folds: int, (default=3), Number of cross-validated folds, only valid when cv is true

  • task: str or None, optinal(default=None), Task type(binary, multiclass or regression). If None, inference the type of task automatically

  • callbacks: list of callback functions or None, (default=None), List of callback functions that are applied at each experiment step. See hypernets.experiment.ExperimentCallback for more information.

  • random_state: int or RandomState instance, (default=9527), Controls the shuffling applied to the data before applying the split.

  • scorer: str, callable or None, (default=None), Scorer to used for feature importance evaluation and ensemble. It can be a single string (see get_scorer) or a callable (see make_scorer). If None, exception will occur.

  • data_cleaner_args: dict, (default=None), dictionary of parameters to initialize the DataCleaner instance. If None, DataCleaner will initialized with default values.

  • collinearity_detection: bool, (default=False), Whether to clear multicollinearity features

  • drift_detection: bool,(default=True), Whether to enable data drift detection and processing. Only valid when X_test is provided. Concept drift in the input data is one of the main challenges. Over time, it will worsen the performance of model on new data. We introduce an adversarial validation approach to concept drift problems in HyperGBM. This approach will detect concept drift and identify the drifted features and process them automatically.

  • feature_reselection: bool, (default=True), Whether to enable two stage feature selection and searching

  • feature_reselection_estimator_size: int, (default=10), The number of estimator to evaluate feature importance. Only valid when feature_reselection is True.

  • feature_reselection_threshold: float, (default=1e-5), The threshold for feature selection. Features with importance below the threshold will be dropped. Only valid when feature_reselection is True.

  • ensemble_size: int, (default=20), The number of estimator to ensemble. During the AutoML process, a lot of models will be generated with different preprocessing pipelines, different models, and different hyperparameters. Usually selecting some of the models that perform well to ensemble can obtain better generalization ability than just selecting the single best model.

  • pseudo_labeling: bool, (default=False), Whether to enable pseudo labeling. Pseudo labeling is a semi-supervised learning technique, instead of manually labeling the unlabelled data, we give approximate labels on the basis of the labelled data. Pseudo-labeling can sometimes improve the generalization capabilities of the model.

  • pseudo_labeling_proba_threshold: float, (default=0.8), Confidence threshold of pseudo-label samples. Only valid when feature_reselection is True.

  • pseudo_labeling_resplit: bool, (default=False), Whether to re-split the training set and evaluation set after adding pseudo-labeled data. If False, the pseudo-labeled data is only appended to the training set. Only valid when feature_reselection is True.

  • retrain_on_wholedata: bool, (default=False), Whether to retrain the model with whole data after the search is completed.

  • log_level: int or None, (default=None), Level of logging, possible values:[logging.CRITICAL, logging.FATAL, logging.ERROR, logging.WARNING, logging.WARN, logging.INFO, logging.DEBUG, logging.NOTSET]

Examples

Experiment Examples

Basic Usages

In this chapter we’ll show how to train models with HyperGBM experiment, we’ll use the blood dataset in the following examples,Class is the target feature.

Recency,Frequency,Monetary,Time,Class
2,50,12500,98,1
0,13,3250,28,1
1,16,4000,35,1
2,20,5000,45,1
1,24,6000,77,0
4,4,1000,4,0

...
Use experiment with default settings

User can create experiment instance with the python tool make_experiment and run it quickly。train_data is the only required parameter, all others are optional. The target is also required if your target feature name isn’t y

Codes:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class')
estimator = experiment.run()
print(estimator)

Outputs:

Pipeline(steps=[('data_clean',
                 DataCleanStep(...),
                ('estimator',
                 GreedyEnsemble(...)])

Process finished with exit code 0

As the console output, the trained model is a pipeline object,the estimator is ensembled by several other models。

If your training data files are .csv or .parquet files,user can call make_experiment with the file path directly,like the following:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = '/path/to/mydata.csv'
experiment = make_experiment(train_data, target='my_target')
estimator = experiment.run()
print(estimator)
Cross Validation

make_experiment enable cross validation as default, user can disable it by set cv= False. Use can change cross fold number with num_folds, just like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', cv=True, num_folds=5)
estimator = experiment.run()
print(estimator)
Setup evaluate data (eval_data)

Experiment split evaluate data from train_data by default if cross validation is disabled, user can customize it with eval_data like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
from sklearn.model_selection import train_test_split

train_data = dsutils.load_blood()
train_data,eval_data=train_test_split(train_data,test_size=0.3)
experiment = make_experiment(train_data, target='Class', eval_data=eval_data, cv=False)
estimator = experiment.run()
print(estimator)

If eval_data is None and cv is False, the experiment will split evaluation data from train_data, user can change evaluation data size with eval_size, like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', cv=False, eval_size=0.2)
estimator = experiment.run()
print(estimator)
Setup search reward metric

The default search reward metric is accuracy,user can change it with reward_metric, like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', reward_metric='auc')
estimator = experiment.run()
print(estimator)
Change search trial number and setup early stopping

User can limit search trial number with max_trials,and setup search early stopping with early_stopping_round, early_stopping_time_limit, early_stopping_reward. like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', max_trials=30, early_stopping_time_limit=3600 * 3)
estimator = experiment.run()
print(estimator)
Drift detection

To enable the feature drift detection, set drift_detection=True, and set test_data with the testing data, like this:

from io import StringIO
import pandas as pd
from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

test_data = """
Recency,Frequency,Monetary,Time
2,10,2500,64
4,5,1250,23
4,9,2250,46
4,5,1250,23
4,8,2000,40
2,12,3000,82
11,24,6000,64
2,7,1750,46
4,11,2750,61
1,7,1750,57
2,11,2750,79
2,3,750,16
4,5,1250,26
2,6,1500,41
"""

train_data = dsutils.load_blood()
test_df = pd.read_csv(StringIO(test_data))
experiment = make_experiment(train_data, test_data=test_df, target='Class', drift_detection=True)
estimator = experiment.run()
print(estimator)

Multicollinearity detection

To enable multicollinearity detection, set collinearity_detection=True, like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', `collinearity_detection=True)
estimator = experiment.run()
print(estimator)
Pseudo labeling

To enable pseudo labeling with two stage searching, set pseudo_labeling=True, like this:

train_data=...
experiment = make_experiment(train_data, pseudo_labeling=True, ...)
Permutation importance feature selection

To enable feature selection by permutation importance with two stage searching, set feature_reselection=True, like this:

train_data=...
experiment = make_experiment(train_data, feature_reselection=True, ...)
Ensemble

To change estimator number for ensemble, set ensemble_size to expected number. Or set ensemble_size=0 to disable ensemble.

train_data = ...
experiment = make_experiment(train_data, ensemble_size=10, ...)
Logging settings

To change logging level, set log_level with log level defined in python logging utility.

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()
experiment = make_experiment(train_data, target='Class', log_level='INFO', verbose=1)
estimator = experiment.run()
print(estimator)

Outputs:

14:24:33 I tabular_toolbox.u._common.py 30 - 2 class detected, {0, 1}, so inferred as a [binary classification] task
14:24:33 I hypergbm.experiment.py 699 - create experiment with ['data_clean', 'drift_detection', 'space_search', 'final_ensemble']
14:24:33 I hypergbm.experiment.py 1262 - make_experiment with train data:(748, 4), test data:None, eval data:None, target:Class
14:24:33 I hypergbm.experiment.py 716 - fit_transform data_clean
14:24:33 I hypergbm.experiment.py 716 - fit_transform drift_detection
14:24:33 I hypergbm.experiment.py 716 - fit_transform space_search
14:24:33 I hypernets.c.meta_learner.py 22 - Initialize Meta Learner: dataset_id:7123e0d8c8bbbac8797ed9e42352dc59
14:24:33 I hypernets.c.callbacks.py 192 - 
Trial No:1
--------------------------------------------------------------
(0) estimator_options.hp_or:                                0
(1) numeric_imputer_0.strategy:                 most_frequent
(2) numeric_scaler_optional_0.hp_opt:                    True


...

14:24:35 I hypergbm.experiment.py 716 - fit_transform final_ensemble
14:24:35 I hypergbm.experiment.py 737 - trained experiment pipeline: ['data_clean', 'estimator']
Pipeline(steps=[('data_clean',
                 DataCleanStep(...),
                ('estimator',
                 GreedyEnsemble(...)

Process finished with exit code 0

Advanced Usages

Customize Searcher and Search Space

User can customize searcher and search space with searcher and search_space, like this:

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils
from hypergbm.search_space import search_space_general


def my_search_space():
    return search_space_general(n_esitimators=100)


train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', searcher='random', search_space=my_search_space)
estimator = experiment.run()
print(estimator)

Or like this:

from hypergbm import make_experiment
from hypergbm.search_space import search_space_general
from hypernets.searchers import MCTSSearcher
from tabular_toolbox.datasets import dsutils

my_searcher = MCTSSearcher(lambda: search_space_general(n_esitimators=100),
                           max_node_space=20,
                           optimize_direction='max')

train_data = dsutils.load_blood()

experiment = make_experiment(train_data, target='Class', searcher=my_searcher)
estimator = experiment.run()
print(estimator)
Use CompeteExperiment

Use can create experiment with class CompeteExperiment for more details.

from hypergbm import HyperGBM, CompeteExperiment
from hypergbm.search_space import search_space_general
from hypernets.core.callbacks import EarlyStoppingCallback, SummaryCallback
from hypernets.searchers import EvolutionSearcher
from tabular_toolbox.datasets import dsutils

train_data = dsutils.load_blood()


def my_search_space():
    return search_space_general(early_stopping_rounds=10, verbose=0, cat_pipeline_mode='complex')


searcher = EvolutionSearcher(my_search_space,
                             optimize_direction='max', population_size=30, sample_size=10,
                             regularized=True, candidates_size=10)

es = EarlyStoppingCallback(time_limit=3600 * 3, mode='max')
hm = HyperGBM(searcher, reward_metric='auc', cache_dir=f'hypergbm_cache', clear_cache=True,
              callbacks=[es, SummaryCallback()])

X = train_data
y = train_data.pop('Class')
experiment = CompeteExperiment(hm, X, y, eval_size=0.2,
                               cv=True, pseudo_labeling=False,
                               max_trials=20, use_cache=True)
estimator = experiment.run()
print(estimator)

Distribution with Dask

Quick Start

To run HyperGBM experiment with Dask cluster, use need to setup the default Dask client before call make_experiment, like this:

from dask.distributed import LocalCluster, Client

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils


def train():
    cluster = LocalCluster(processes=True)
    client = Client(cluster)

    train_data = '/opt/data/my_data.csv'

    experiment = make_experiment(train_data, target='...')
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()

User can also use dask.dataframe load training data set Dask DataFrame to create experiment:

from dask import dataframe as dd
from dask.distributed import LocalCluster, Client

from hypergbm import make_experiment
from tabular_toolbox.datasets import dsutils


def train():
    cluster = LocalCluster(processes=False)
    client = Client(cluster)

    train_data = dd.from_pandas(dsutils.load_blood(), npartitions=1)

    experiment = make_experiment(train_data, target='Class')
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()

Reference Dask Create DataFrames for more details

Customize Search Space

To run experiment with Dask cluster, all transformers and estimators must support Dask objects, reference hypergbm.dask.search_space.search_space_general for more details to customize search space pls。

from dask import dataframe as dd
from dask.distributed import LocalCluster, Client

from hypergbm import make_experiment
from hypergbm.dask.search_space import search_space_general
from tabular_toolbox.datasets import dsutils


def my_search_space():
    return search_space_general(n_esitimators=100)


def train():
    cluster = LocalCluster(processes=False)
    client = Client(cluster)

    train_data = dd.from_pandas(dsutils.load_blood(), npartitions=1)

    experiment = make_experiment(train_data, target='Class', searcher='mcts', search_space=my_search_space)
    estimator = experiment.run()
    print(estimator)


if __name__ == '__main__':
    train()

Using HyperGBM

Basic examples

Cross-validation

HyperGBM supports cross-validation to evaluate the model, specify cv=True to enable it and param num_fold used to set folds:

...
hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
hk.search(X_train, y_train, X_eval=None, y_eval=None, cv=True, num_folds=3)  # 3 folds
...

Evaluation data should be a fold of X_train and y_train, so set X_eval=None and y_eval=None. Here is an example :

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> 
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> 
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM
>>> from hypernets.searchers import MCTSSearcher
>>> 
>>> rs = MCTSSearcher(search_space_general, max_node_space=10, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> hk.search(X_train, y_train, X_eval=None, y_eval=None, cv=True, num_folds=3)  # using Cross Validation
   Trial No.    Reward   Elapsed                       Space Vector
0          4  0.941667  0.331012   [1, 3, 1, 1, 370, 3, 2, 3, 4, 0]
1          7  0.933333  0.290077  [0, 0, 1, 0, 3, 1, 1, 2, 1, 2, 3]
2          1  0.925000  0.472835     [0, 0, 0, 3, 0, 1, 0, 2, 0, 4]
3          3  0.925000  0.422006     [0, 1, 0, 1, 1, 1, 1, 0, 0, 1]
4          8  0.925000  0.228165     [0, 1, 0, 3, 2, 0, 2, 0, 2, 0]
>>> estimator = hk.load_estimator(hk.get_best_trial().model_file)
>>> 
>>> estimator.cv_gbm_models_
[LGBMClassifierWrapper(boosting_type='dart', learning_rate=0.5, max_depth=10,
                      n_estimators=200, num_leaves=370, reg_alpha=1,
                      reg_lambda=1), LGBMClassifierWrapper(boosting_type='dart', learning_rate=0.5, max_depth=10,
                      n_estimators=200, num_leaves=370, reg_alpha=1,
                      reg_lambda=1), LGBMClassifierWrapper(boosting_type='dart', learning_rate=0.5, max_depth=10,
                      n_estimators=200, num_leaves=370, reg_alpha=1,
                      reg_lambda=1)]
Search strategies

HyperGBM provides following search strategies(implementation class):

  • Evolution search(hypernets.searchers.evolution_searcher.EvolutionSearcher)

  • Monte Carlo Tree Search(hypernets.searchers.mcts_searcher.MCTSSearcher)

  • Random search(hypernets.searchers.random_searcher.RandomSearcher)

Here is an example that using evolution search strategy:

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> 
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> 
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> rs = EvolutionSearcher(search_space_general,  200, 100, optimize_direction='max')  # using EvolutionSearcher
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> hk.search(X_train, y_train, X_eval=X_test, y_eval=y_test)
   Trial No.  Reward   Elapsed                      Space Vector
0          1     1.0  0.187103     [1, 2, 0, 1, 160, 3, 0, 1, 2]
1          2     1.0  0.358584             [2, 3, 1, 3, 2, 0, 0]
2          3     1.0  0.127980  [1, 1, 1, 0, 125, 0, 0, 3, 3, 0]
3          4     1.0  0.084272     [1, 1, 0, 2, 115, 1, 2, 3, 0]
4          7     1.0  0.152720     [1, 0, 0, 1, 215, 3, 3, 1, 2]
>>> estimator = hk.load_estimator(hk.get_best_trial().model_file)
>>> y_pred = estimator.predict(X_test)
>>> 
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, y_pred)
1.0
Early stopping

When the performance of the model can not be improved or meet certain conditions, the training can be terminated in advance to release computing resources, now supported strategies:

  • max_no_improvement_trials

  • time_limit

  • expected_reward

When multiple conditions are set, it will stop when any condition is reached first; The early stop strategy is implemented through class hypernets.core.callbacks.EarlyStoppingCallback;

Here is an example that training stops searching when the reward reaches above 0.95:

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> from hypernets.core import EarlyStoppingCallback
>>> 
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> rs = EvolutionSearcher(search_space_general,  200, 100, optimize_direction='max')
>>> es = EarlyStoppingCallback(expected_reward=0.95, mode='max')  # Parameter `mode` is the direction of parameter `expected_reward` optimization, the reward metric is accuracy, so set mode to `max`
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[es])
>>> hk.search(X_train, y_train, X_eval=X_test, y_eval=y_test)

Early stopping on trial : 1, best reward: None, best_trial: None
   Trial No.  Reward   Elapsed                       Space Vector
0          1     1.0  0.189758  [0, 1, 1, 3, 2, 1, 1, 2, 3, 0, 0]

Advanced examples

Pseudo label

HyperGBM is allowed to use test set training in a semi-supervised way to improve model performance, usage:

...
experiment = CompeteExperiment(hk, X_train, y_train, X_test=X_test, callbacks=[], scorer=get_scorer('accuracy'),
                               pseudo_labeling=True,  # Enable pseudo label
                               pseudo_labeling_proba_threshold=0.9)
...

Here is an example:

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> 
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM, CompeteExperiment
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> from sklearn.metrics import get_scorer
>>> 
>>> rs = EvolutionSearcher(search_space_general,  200, 100, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> experiment = CompeteExperiment(hk, X_train, y_train, X_test=X_test, callbacks=[], scorer=get_scorer('accuracy'),
...                                pseudo_labeling=True,  # enable pseudo
...                                pseudo_labeling_proba_threshold=0.9)
>>> 
>>> pipeline = experiment.run(use_cache=True, max_trials=10)  # first stage train a model to label test dataset, the second stage train using labeled test dataset and train dataset 
   Trial No.    Reward   Elapsed                       Space Vector
0          3  0.972222  0.194367  [0, 3, 1, 2, 3, 1, 3, 0, 0, 1, 0]
1          5  0.972222  0.130711  [0, 2, 1, 0, 2, 0, 3, 0, 1, 4, 3]
2          8  0.972222  0.113038     [0, 1, 0, 0, 1, 0, 2, 0, 2, 3]
3         10  0.972222  0.134826      [1, 2, 0, 0, 500, 3, 2, 3, 4]
4          1  0.944444  0.251970                 [2, 2, 0, 3, 1, 2]

   Trial No.    Reward   Elapsed           Space Vector
0          1  0.972222  0.338019  [2, 0, 1, 0, 2, 4, 1]
1          2  0.972222  0.232059  [2, 3, 1, 1, 0, 4, 1]
2          3  0.972222  0.207254     [2, 3, 0, 3, 0, 2]
3          4  0.972222  0.262670  [2, 1, 1, 2, 1, 1, 0]
4          6  0.972222  0.246977     [2, 3, 0, 3, 1, 1]
>>> pipeline
Pipeline(steps=[('data_clean',
                 DataCleanStep(data_cleaner_args={}, name='data_clean',
                               random_state=9527)),
                ('estimator',
                 GreedyEnsemble(weight=[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.], estimators=[<hypergbm.estimators.CatBoostClassifierWrapper object at 0x1a38139110>, None, None, None, None, None, None, None, None, None]))])
>>> import numpy as np
>>> y_pred = pipeline.predict(X_test).astype(np.float64)
>>> 
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_pred, y_test)
1.0
Features selection

HyperGBM turn features into noise one by one for training, the more the model performance degradation, the more important the features become noise, so as to evaluate the importance of features. Accord to features importance select part of the features and retraining model to save computing resources and time, here is an example:

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> 
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM, CompeteExperiment
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> from sklearn.metrics import get_scorer
>>> 
>>> rs = EvolutionSearcher(search_space_general,  200, 100, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> 
>>> experiment = CompeteExperiment(hk, X_train, y_train, X_test, y_test, callbacks=[], scorer=get_scorer('accuracy'),
...                                feature_reselection=True,  # enable feature importance selection
...                                feature_reselection_estimator_size=3,  # use 3 estimators to evaluate feature importance
...                                feature_reselection_threshold=0.01)  # importance less than the threshold will not be selected
>>> pipeline = experiment.run(use_cache=True, max_trials=10)
   Trial No.  Reward   Elapsed                      Space Vector
0          2     1.0  0.373262                [2, 3, 0, 2, 2, 1]
1          3     1.0  0.194120  [1, 3, 1, 1, 365, 1, 3, 1, 0, 3]
2          4     1.0  0.109643  [1, 0, 1, 2, 140, 0, 2, 3, 4, 1]
3          6     1.0  0.107316    [0, 3, 0, 2, 2, 0, 1, 2, 2, 2]
4          7     1.0  0.117224   [1, 0, 1, 2, 40, 2, 1, 2, 4, 0]

             feature  importance       std
0  sepal length (cm)    0.000000  0.000000
1   sepal width (cm)    0.011111  0.015713
2  petal length (cm)    0.495556  0.199580
3   petal width (cm)    0.171111  0.112787

   Trial No.  Reward   Elapsed                      Space Vector
0          1     1.0  0.204705    [0, 1, 0, 2, 0, 1, 3, 0, 4, 3]
1          2     1.0  0.109204   [1, 1, 1, 2, 90, 1, 2, 0, 0, 1]
2          3     1.0  0.160209  [1, 2, 1, 0, 305, 3, 0, 0, 1, 1]
3          4     1.0  1.062759             [2, 1, 1, 2, 3, 1, 0]
4          6     1.0  0.218692    [0, 0, 0, 1, 0, 1, 2, 0, 0, 3]
>>> 
>>> import numpy as np
>>> y_pred = pipeline.predict(X_test).astype(np.float64)
>>> 
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_pred, y_test)
1.0
Concept drift
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> 
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> 
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM, CompeteExperiment
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> from sklearn.metrics import get_scorer
>>> 
>>> rs = EvolutionSearcher(search_space_general,  200, 100, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> 
>>> experiment = CompeteExperiment(hk, X_train, y_train, X_test, y_test, callbacks=[], scorer=get_scorer('accuracy'),
...                                drift_detection=True)  # enable drift detection
>>> pipeline = experiment.run(use_cache=True, max_trials=10)
   Trial No.  Reward   Elapsed                       Space Vector
0          1     1.0  0.236796              [2, 2, 1, 3, 0, 4, 2]
1          3     1.0  0.207033     [0, 0, 0, 4, 1, 1, 2, 2, 1, 3]
2          4     1.0  0.106351      [1, 2, 0, 2, 240, 3, 2, 1, 2]
3          5     1.0  0.110495     [0, 0, 0, 2, 2, 0, 2, 1, 2, 2]
4          6     1.0  0.175838  [0, 3, 1, 3, 2, 1, 3, 1, 1, 4, 1]
>>> import numpy as np
>>> y_pred = pipeline.predict(X_test).astype(np.float64)
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_pred, y_test)
1.0
Ensemble

HyperGBM supports the combination of better models generated in the search process to a model with better generalization ability, example:

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> 
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> 
>>> from hypergbm.search_space import search_space_general
>>> from hypergbm import HyperGBM, CompeteExperiment
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> from sklearn.metrics import get_scorer
>>> 
>>> rs = EvolutionSearcher(search_space_general,  200, 100, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> experiment = CompeteExperiment(hk, X_train, y_train, X_test, y_test, callbacks=[], scorer=get_scorer('accuracy'),
...                                ensemble_size=5)  # set ensemble
>>> pipeline = experiment.run(use_cache=True, max_trials=10)
   Trial No.  Reward   Elapsed                     Space Vector
0          1     1.0  0.856545            [2, 1, 1, 1, 3, 4, 1]
1          2     1.0  0.271147               [2, 0, 0, 1, 0, 2]
2          3     1.0  0.160234  [1, 0, 1, 0, 45, 2, 1, 3, 4, 0]
3          4     1.0  0.279989            [2, 0, 1, 0, 0, 1, 4]
4          5     1.0  0.262032            [2, 3, 1, 1, 0, 3, 2]
>>> import numpy as np
>>> y_pred = pipeline.predict(X_test).astype(np.float64)
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_pred, y_test)
1.0

Distributed

Custom search space

Feature generation

More features can be generated based on continuous features, such as the difference between two columns:

>>> import pandas as pd
>>> df = pd.DataFrame(data={"x1": [1, 2, 4], "x2": [9, 8, 7]})
>>> df
   x1  x2
0   1   9
1   2   8
2   4   7
>>> from hypergbm.feature_generators import FeatureGenerationTransformer
>>> ft = FeatureGenerationTransformer(trans_primitives=['subtract_numeric'])
>>> ft.fit(df)
<hypergbm.feature_generators.FeatureGenerationTransformer object at 0x101839d10>
>>> ft.transform(df)
                      x1  x2  x1 - x2
e_hypernets_ft_index                 
0                      1   9       -8
1                      2   8       -6
2                      4   7       -3

In addition to the subtract_numeric operation, it also support:

  • add_numeric

  • subtract_numeric

  • divide_numeric

  • multiply_numeric

  • negate

  • modulo_numeric

  • modulo_by_feature

  • cum_mean

  • cum_sum

  • cum_min

  • cum_max

  • percentile

  • absolute

It can also extract fields such as year, month, day and etc. from the datetime feature:

>>> import pandas as pd
>>> from datetime import datetime
>>> df = pd.DataFrame(data={"x1":  pd.to_datetime([datetime.now()] * 10)})
>>> df[:3]
                          x1
0 2021-01-25 10:27:54.776580
1 2021-01-25 10:27:54.776580
2 2021-01-25 10:27:54.776580

>>> from hypergbm.feature_generators import FeatureGenerationTransformer
>>> ft = FeatureGenerationTransformer(trans_primitives=["year", "month", "week", "minute", "day", "hour", "minute", "second", "weekday", "is_weekend"])
>>> ft.fit(df)
<hypergbm.feature_generators.FeatureGenerationTransformer object at 0x1a29624dd0>
>>> ft.transform(df)
                                             x1  YEAR(x1)  MONTH(x1)  WEEK(x1)  MINUTE(x1)  DAY(x1)  HOUR(x1)  SECOND(x1)  WEEKDAY(x1)  IS_WEEKEND(x1)
e_hypernets_ft_index                                                                                                                                  
0                    2021-01-25 10:27:54.776580      2021          1         4          27       25        10          54            0           False
1                    2021-01-25 10:27:54.776580      2021          1         4          27       25        10          54            0           False
2                    2021-01-25 10:27:54.776580      2021          1         4          27       25        10          54            0           False
3                    2021-01-25 10:27:54.776580      2021          1         4          27       25        10          54            0           False

Using feature generation in search space:

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> from hypergbm.estimators import XGBoostEstimator
>>> from hypergbm.pipeline import Pipeline
>>> from hypergbm.sklearn.transformers import FeatureGenerationTransformer
>>> from hypernets.core.ops import ModuleChoice, HyperInput
>>> from hypernets.core.search_space import HyperSpace
>>> from tabular_toolbox.column_selector import column_exclude_datetime
>>> 
>>> def search_space(task=None):  # Define a search space include feature geeration
...     space = HyperSpace()
...     with space.as_default():
...         input = HyperInput(name='input1')
...         feature_gen = FeatureGenerationTransformer(task=task,  # Add feature generation to search space
...                                                    trans_primitives=["add_numeric", "subtract_numeric", "divide_numeric", "multiply_numeric"]) 
...         full_pipeline = Pipeline([feature_gen], name=f'feature_gen_and_preprocess', columns=column_exclude_datetime)(input)
...         xgb_est = XGBoostEstimator(fit_kwargs={})
...         ModuleChoice([xgb_est], name='estimator_options')(full_pipeline)
...         space.set_inputs(input)
...     return space
>>> 
>>> from hypergbm import HyperGBM
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> 
>>> rs = EvolutionSearcher(search_space,  200, 100, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> hk.search(X_train, y_train, X_eval=X_test, y_eval=y_test)
   Trial No.  Reward   Elapsed Space Vector
0          1     1.0  0.376869           []
>>> estimator = hk.load_estimator(hk.get_best_trial().model_file)
>>> y_pred = estimator.predict(X_test)
>>> 
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, y_pred)
1.0
Using GBM estimators

The GBM algorithms (wrapper class) supported by HyperGBM are:

  • XGBoost (hypergbm.estimators.XGBoostEstimator)

  • HistGB (hypergbm.estimators.HistGBEstimator)

  • LightGBM (hypergbm.estimators.LightGBMEstimator)

  • CatBoost (hypergbm.estimators.CatBoostEstimator)

The hyper-parameters are defined into the search space to use in training, here is an example that using xgboost to train iris:

# Load dataset
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X[:3]
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
>>> y[:3]
0    0
1    0
2    0
Name: target, dtype: int64
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

>>> from hypergbm.estimators import XGBoostEstimator
>>> from hypergbm.estimators import XGBoostEstimator
>>> from hypergbm.pipeline import Pipeline, DataFrameMapper
>>> from hypergbm.sklearn.transformers import MinMaxScaler, StandardScaler
>>> from hypernets.core import OptimizeDirection
>>> from hypernets.core.ops import ModuleChoice, HyperInput
>>> from hypernets.core.search_space import HyperSpace
>>> from tabular_toolbox.column_selector import column_number_exclude_timedelta
# Define search space included XGBoost
>>> def search_space():
...     space = HyperSpace()
...     with space.as_default():
...         input = HyperInput(name='input1')
...         scaler_choice = ModuleChoice(
...             [
...                 StandardScaler(name=f'numeric_standard_scaler'),
...                 MinMaxScaler(name=f'numeric_minmax_scaler')
...             ], name=f'numeric_or_scaler'
...         )
...         num_pipeline = Pipeline([scaler_choice], name='numeric_pipeline', columns=column_number_exclude_timedelta)(input)
...         union_pipeline = DataFrameMapper(default=None, input_df=True, df_out=True)([num_pipeline])
...         xgb_est = XGBoostEstimator(fit_kwargs={})
...         ModuleChoice([xgb_est], name='estimator_options')(union_pipeline)  # Make xgboost as a estimator choice
...         space.set_inputs(input)
...     return space

# Search
>>> from hypergbm import HyperGBM
>>> from hypernets.searchers import MCTSSearcher
>>> rs = MCTSSearcher(search_space, max_node_space=10, optimize_direction=OptimizeDirection.Maximize)
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> 
>>> hk.search(X_train, y_train, X_eval=X_test, y_eval=y_test)
   Trial No.  Reward   Elapsed Space Vector
0          1     1.0  0.206926          [0]
1          2     1.0  0.069099          [1]
Class balancing

HyperGBM supports several strategies for unbalanced data sampling:

Class weight

  • ClassWeight

Over sampling

  • RandomOverSampling

  • SMOTE

  • ADASYN

Down sampling

  • RandomUnderSampling

  • NearMiss

  • TomeksLinks

Configure class balancing policies in estimator:

...
xgb_est = XGBoostEstimator(fit_kwargs={}, class_balancing='ClassWeight')  # Use class balancing
...

Here is an example that training with ClassWeight sampling strategy:

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split 
>>> X, y = load_iris(return_X_y=True, as_frame=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> from hypergbm.estimators import XGBoostEstimator
>>> from hypergbm.pipeline import Pipeline, DataFrameMapper
>>> from hypergbm.sklearn.transformers import MinMaxScaler, StandardScaler
>>> from hypernets.core.ops import ModuleChoice, HyperInput
>>> from hypernets.core.search_space import HyperSpace
>>> from tabular_toolbox.column_selector import column_number_exclude_timedelta
>>> 
>>> def search_space():
...     space = HyperSpace()
...     with space.as_default():
...         input = HyperInput(name='input1')
...         scaler_choice = ModuleChoice(
...             [
...                 StandardScaler(name=f'numeric_standard_scaler'),
...                 MinMaxScaler(name=f'numeric_minmax_scaler')
...             ], name='numeric_or_scaler'
...         )
...         num_pipeline = Pipeline([scaler_choice], name='numeric_pipeline', columns=column_number_exclude_timedelta)(input)
...         union_pipeline = DataFrameMapper(default=None, input_df=True, df_out=True)([num_pipeline])
...         xgb_est = XGBoostEstimator(fit_kwargs={}, class_balancing='ClassWeight')  # Use class balancing
...         ModuleChoice([xgb_est], name='estimator_options')(union_pipeline)
...         space.set_inputs(input)
...     return space
>>> from hypergbm import HyperGBM
>>> from hypernets.searchers.evolution_searcher import EvolutionSearcher
>>> 
>>> rs = EvolutionSearcher(search_space,  200, 100, optimize_direction='max')
>>> hk = HyperGBM(rs, task='multiclass', reward_metric='accuracy', callbacks=[])
>>> hk.search(X_train, y_train, X_eval=X_test, y_eval=y_test)
   Trial No.  Reward   Elapsed Space Vector
0          1     1.0  0.100520          [0]
1          2     1.0  0.083927          [1]

How-To

How to install shap on centos7?

  1. Install system dependencies

    yum install epel-release centos-release-scl -y  && yum clean all && yum make cache # llvm9.0 is in epel, gcc9 in scl
    yum install -y llvm9.0 llvm9.0-devel python36-devel devtoolset-9-gcc devtoolset-9-gcc-c++ make cmake 
    
  2. Configure install environment

    whereis llvm-config-9.0-64  # find your `llvm-config` path
    # llvm-config-9: /usr/bin/llvm-config-9.0-64
    
    export LLVM_CONFIG=/usr/bin/llvm-config-9.0-64  # set to your path
    scl enable devtoolset-9 bash
    
  3. Install shap

    pip3 -v install numpy==1.19.1  # prepare shap dependency
    pip3 -v install scikit-learn==0.23.1  # prepare shap dependency
    pip3 -v install shap==0.28.5
    

If it is very slow to download dependencies package of shap, consider using faster PIP and setuptools mirros. Take using the mirror provided by aliyun as an example, Create file ~/.pip/pip.conf with content:

[global]
index-url = https://mirrors.aliyun.com/pypi/simple

Continue create file ~/.pydistutils.cfg with content:

[easy_install]
index_url = https://mirrors.aliyun.com/pypi/simple

Release Note

Version 0.2.0

This release add following new features:

Feature engineering
  • Feature generation

  • Feature selection

Data clean
  • Special empty value handing

  • Correct data type

  • Id-ness features cleanup

  • Duplicate features cleanup

  • Empty label rows cleanup

  • Illegal values replacement

  • Constant features cleanup

  • Collinearity features cleanup

Data set split
  • Adversarial validation

Modeling algorithms
  • XGBoost

  • Catboost

  • LightGBM

  • HistGridientBoosting

Training
  • Task inference

  • Command-line tools

Evaluation strategies:
  • Cross-validation

  • Train-Validation-Holdout

Search strategies
  • Monte Carlo Tree Search

  • Evolution

  • Random search

Imbalance data
  • Class Weight

  • Under-Samping - Near miss - Tomeks links - Random

  • Over-Samping - SMOTE - ADASYN - Random

Early stopping strategies
  • max_no_improvement_trials

  • time_limit

  • expected_reward

Advance features:
  • Two stage search - Pseudo label - Feature selection

  • Concept drift handling

  • Ensemble

Indices and tables