Features

There are three types of running HyperGBM：

Single node：running in a single machine and using Pandas and Numpy datatype
Distributed with single node：running in a single machine and using Dask datatype which requires creating Dask collections before using HyperGBM
Distributed with multi nodes：running in multiple machines and using Dask datatype which requires creating Dask collections to manage resources for multiple machines before using HyperGBM

The supported features are different for different running types as in the following table:

	Features	Single Machine	Distributed with single node	Distributed with multi nodes
Data Cleaning	Empty characters handling	√	√	√
	Recognizing columns types automatically	√	√	√
	Columns types correction	√	√	√
	Constant columns cleaning	√	√	√
	Repeated columns cleaning	√	√	√
	Deleting examples without targets	√	√	√
	Illegal characters replacing	√	√	√
	id columns cleaning	√	√	√
Dataset splitting	Splitting by ratio	√	√	√
	Adversarial validation	√	√	√
Feature engineering	Feature generation	√	√	√
	Feature dimension reduction	√	√	√
Data preprocessing	SimpleImputer	√	√	√
	SafeOrdinalEncoder	√	√	√
	SafeOneHotEncoder	√	√	√
	TruncatedSVD	√	√	√
	StandardScaler	√	√	√
	MinMaxScaler	√	√	√
	MaxAbsScaler	√	√	√
	RobustScaler	√	√	√
Imbalanced data handling	ClassWeight	√	√	√
	UnderSampling(Nearmiss,Tomekslinks,Random)	√
	OverSampling(SMOTE,ADASYN,Random)	√
Search algorithms	MCTS	√	√	√
	Evolution	√	√	√
	Random search	√	√	√
	Play back	√	√	√
Early stopping	time limit	√	√	√
	no improvements are made after n trials	√	√	√
	expected_reward	√	√	√
	trail discriminator	√	√	√
Modeling algorithms	XGBoost	√	√	√
	LightGBM	√	√	√
	CatBoost	√	√
	HistGridientBoosting	√
Evaluation	Cross-Validation	√	√	√
	Train-Validation-Holdout	√	√	√
Advanced features	Automatica task type inference	√	√	√
	Collinearity detection	√	√	√
	Data drift detection	√	√	√
	Feature selection	√	√	√
	Feature selection(Two-stage)	√	√	√
	Pseudo label(Two-stage)	√	√	√
	Pre-searching with UnderSampling	√	√	√
	Model ensemble	√	√	√