Features

There are three types of running HyperGBM:

  • Single node:running in a single machine and using Pandas and Numpy datatype

  • Distributed with single node:running in a single machine and using Dask datatype which requires creating Dask collections before using HyperGBM

  • Distributed with multi nodes:running in multiple machines and using Dask datatype which requires creating Dask collections to manage resources for multiple machines before using HyperGBM

The supported features are different for different running types as in the following table:

Features

Single Machine

Distributed with single node

Distributed with multi nodes

Data Cleaning

Empty characters handling

Recognizing columns types automatically

Columns types correction

Constant columns cleaning

Repeated columns cleaning

Deleting examples without targets

Illegal characters replacing

id columns cleaning

Dataset splitting

Splitting by ratio

Adversarial validation

Feature engineering

Feature generation

Feature dimension reduction

Data preprocessing

SimpleImputer

SafeOrdinalEncoder

SafeOneHotEncoder

TruncatedSVD

StandardScaler

MinMaxScaler

MaxAbsScaler

RobustScaler

Imbalanced data handling

ClassWeight

UnderSampling(Nearmiss,Tomekslinks,Random)

OverSampling(SMOTE,ADASYN,Random)

Search algorithms

MCTS

Evolution

Random search

Play back

Early stopping

time limit

no improvements are made after n trials

expected_reward

trail discriminator

Modeling algorithms

XGBoost

LightGBM

CatBoost

HistGridientBoosting

Evaluation

Cross-Validation

Train-Validation-Holdout

Advanced features

Automatica task type inference

Collinearity detection

Data drift detection

Feature selection

Feature selection(Two-stage)

Pseudo label(Two-stage)

Pre-searching with UnderSampling

Model ensemble