mmtfPyspark.ml.datasetBalancer module¶

dataBalancer.py:

Creates a balanced dataset for classification problems by either downsampling the majority classes or upsampling the minority classes. It randomly samples each class and returns a dataset with approximately the same number of samples in each class

downsample(data, columnName, seed=7)[source]¶

Returns a balanced dataset for the given column name by downsampling the majority classes. The classification column must be of type String

Parameters:

Parameters:	data : Dataframe columnName : str column to be balanced by seed : int random number seed

data : Dataframe

columnName : str

column to be balanced by

seed : int

random number seed

upsample(data, columnName, seed=7)[source]¶

Returns a balanced dataset for the given column name by upsampling the majority classes. The classification column must be of type String

Parameters:

Parameters:	data : Dataframe) columnName : str column to be balanced by seed : int random number seed

data : Dataframe)

columnName : str

column to be balanced by

seed : int

random number seed