What is HyperGBM

HyperGBM is a library that supports full-pipeline AutoML, which completely covers the end-to-end stages of data cleaning, preprocessing, feature generation and selection, model selection and hyperparameter optimization.It is a real-AutoML tool for tabular data.

Unlike most AutoML approaches that focus on tackling the hyperparameter optimization problem of machine learning algorithms, HyperGBM can put the entire process from data cleaning to algorithm selection in one search space for optimization. End-to-end pipeline optimization is more like a sequential decision process, thereby HyperGBM uses reinforcement learning, Monte Carlo Tree Search, evolution algorithm combined with a meta-learner to efficiently solve such problems. As the name implies, the ML algorithms used in HyperGBM are all GBM models, and more precisely the gradient boosting tree model, which currently includes XGBoost, LightGBM and Catboost. The underlying search space representation and search algorithm in HyperGBM are powered by the Hypernets project a general AutoML framework.

Main components

In this section, we briefly cover the main components in HyperGBM. As shown below:

  • HyperGBM(HyperModel)

    HyperGBM is a specific implementation of HyperModel (for HyperModel, please refer to the Hypernets project). It is the core interface of the HyperGBM project. By calling the search method to explore and return the best model in the specified Search Space with the specified Searcher.

  • Search Space

    Search spaces are constructed by arranging ModelSpace(transformer and estimator), ConnectionSpace(pipeline) and ParameterSpace(hyperparameter). The transformers are chained together by pipelines while the pipelines can be nested. The last node of a search space must be an estimator. Each transformer and estimator can define a set of hyperparameterss.


The code example of Numeric Pipeline is as follows:

import numpy as np
from hypergbm.pipeline import Pipeline
from hypergbm.sklearn.transformers import SimpleImputer, StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, LogStandardScaler
from hypernets.core.ops import ModuleChoice, Optional, Choice
from tabular_toolbox.column_selector import  column_number_exclude_timedelta

def numeric_pipeline_complex(impute_strategy=None, seq_no=0):
    if impute_strategy is None:
        impute_strategy = Choice(['mean', 'median', 'constant', 'most_frequent'])
    elif isinstance(impute_strategy, list):
        impute_strategy = Choice(impute_strategy)

    imputer = SimpleImputer(missing_values=np.nan, strategy=impute_strategy, name=f'numeric_imputer_{seq_no}',
    scaler_options = ModuleChoice(
        ], name=f'numeric_or_scaler_{seq_no}'
    scaler_optional = Optional(scaler_options, keep_link=True, name=f'numeric_scaler_optional_{seq_no}')
    pipeline = Pipeline([imputer, scaler_optional],
    return pipeline
  • Searcher

    Searcher is an algorithm used to explore a search space.It encompasses the classical exploration-exploitation trade-off since, on the one hand, it is desirable to find well-performing model quickly, while on the other hand, premature convergence to a region of suboptimal solutions should be avoided. Three algorithms are provided in HyperGBM: MCTSSearcher (Monte-Carlo tree search), EvolutionarySearcher and RandomSearcher.

  • HyperGBMEstimator

    HyperGBMEstimator is an object built from a sample in the search space, including the full preprocessing pipeline and a GBM model. It can be used to fit on training data, evaluate on evaluation data, and predict on new data.

  • CompeteExperiment

    CompeteExperiment is a powerful tool provided in HyperGBM. It not only performs pipeline search, but also contains some advanced features to further improve the model performance such as data drift handling, pseudo-labeling, ensemble, etc.

Feature matrix

There are 3 training modes:

  • Standalone

  • Distributed on single machine

  • Distributed on multiple machines

Here is feature matrix of training modes:




Distributed on single machine

Distributed on multiple machines

Feature engineering

Feature generation
Feature dimension reduction

Data clean

Correct data type
Special empty value handing
Id-ness features cleanup
Duplicate features cleanup
Empty label rows cleanup
Illegal values replacement
Constant features cleanup
Collinearity features cleanup

Data set split

Adversarial validation

Modeling algorithms



Task inference
Command-line tools

Evaluation strategies


Search strategies

Monte Carlo Tree Search
Random search

Class balancing

Class Weight
Under-Samping(Near miss,Tomeks links,Random)

Early stopping strategies


Advance features

Two stage search(Pseudo label,Feature selection)
Concept drift handling