mmtfPyspark.ml.datasetBalancer module

dataBalancer.py:

Creates a balanced dataset for classification problems by either downsampling the majority classes or upsampling the minority classes. It randomly samples each class and returns a dataset with approximately the same number of samples in each class

downsample(data, columnName, seed=7)[source]

Returns a balanced dataset for the given column name by downsampling the majority classes. The classification column must be of type String

Parameters:

data : Dataframe

columnName : str

column to be balanced by

seed : int

random number seed

upsample(data, columnName, seed=7)[source]

Returns a balanced dataset for the given column name by upsampling the majority classes. The classification column must be of type String

Parameters:

data : Dataframe)

columnName : str

column to be balanced by

seed : int

random number seed