mmtfPyspark.ml.datasetBalancer module
dataBalancer.py:
Creates a balanced dataset for classification problems by either
downsampling the majority classes or upsampling the minority classes.
It randomly samples each class and returns a dataset with approximately
the same number of samples in each class
-
downsample
(data, columnName, seed=7)[source]
Returns a balanced dataset for the given column name by downsampling
the majority classes.
The classification column must be of type String
Parameters: | data : Dataframe
columnName : str
seed : int
|
-
upsample
(data, columnName, seed=7)[source]
Returns a balanced dataset for the given column name by upsampling
the majority classes.
The classification column must be of type String
Parameters: | data : Dataframe)
columnName : str
seed : int
|