automl-toolkit Issue with DataSplitUtility repartition(0)

When following this tutorial, I encounter the following error during feature selection thrown by DataSplitUtility: java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.

The thing I do differently from the tutorial is setting the trainTestSplitMethod to "chronological" as in:

Map(
  ...
  "tunerTrainSplitMethod" -> "chronological",
  "tunerTrainSplitChronologicalColumn" -> "id",
  "tunerTrainSplitChronologicalRandomPercentage" -> 0.25,
  ...
)

Any ideas on how to fix the issue?

I am using:

Spark 3.2.0
Hadoop 3.3.1
Scala 2.12.15
automl-toolkit 0.8.1

Dec 21 '21 15:12 ottobricks

The relevant logs are:

  java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
  at scala.Predef$.require(Predef.scala:281)
  at org.apache.spark.sql.catalyst.plans.logical.Repartition.<init>(basicLogicalOperators.scala:1372)
  at org.apache.spark.sql.Dataset.repartition(Dataset.scala:3022)
  at com.databricks.labs.automl.model.tools.split.SplitOperators$.optimizeTestTrain(SplitOperators.scala:371)
  at com.databricks.labs.automl.model.tools.split.DataSplitUtility.$anonfun$trainSplitPersist$1(
      DataSplitUtility.scala:108
  )
  at com.databricks.labs.automl.model.tools.split.DataSplitUtility.$anonfun$trainSplitPersist$1$adapted(
       DataSplitUtility.scala:85
  )
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.immutable.Range.foreach(Range.scala:158)
  at scala.collection.TraversableLike.map(TraversableLike.scala:286)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
  at scala.collection.AbstractTraversable.map(Traversable.scala:108)
  at com.databricks.labs.automl.model.tools.split.DataSplitUtility.trainSplitPersist(DataSplitUtility.scala:85)
  at com.databricks.labs.automl.model.tools.split.DataSplitUtility.performSplit(DataSplitUtility.scala:248)
  at com.databricks.labs.automl.model.tools.split.DataSplitUtility$.split(DataSplitUtility.scala:291)
  at com.databricks.labs.automl.exploration.FeatureImportances.getImportances(
      FeatureImportances.scala:171
  )
  at com.databricks.labs.automl.exploration.FeatureImportances.generateFeatureImportances(
      FeatureImportances.scala:382
  )

Dec 22 '21 04:12 ottobricks

~~Since yesterday, I tried using FamilyRunner and it works as long as I don't use "chronological" split method.~~ ~~The error I get with FamilyRunner is different from the above. In my understanding, DropColumnsTransformer drops tunerTrainSplitChronologicalColumn despite the fact that I add it to fieldsToIgnoreInVector.~~

~~In my understanding, columns in fieldsToIgnoreInVector should be left untouched by all transformers, but it doesn't seem to be the case. It is possible to spot the problem with the debug flag. In my experiment, tunerTrainSplitChronologicalColumn -> "id_col", but it is not present in the step output dataset:~~

... ~~Output dataset schema: root~~ ~~ |-- label_col: integer (nullable = true)~~ ~~ |-- automl_internal_id: long (nullable = false)~~ ~~ |-- features: vector (nullable = true)~~

~~=== End of class com.databricks.labs.automl.pipeline.DropColumnsTransformer Pipeline Stage log <==~~

~~I will look deeper into this and open a PR to fix it.~~

EDIT: moved issue to its own page since the repartition(0) issue reported here still persists after moving to FamilyRunner with split method "stratified".

Dec 22 '21 06:12 ottobricks

The same error persists with FamilyRunner. I am investigating the problem.

Dec 22 '21 10:12 ottobricks

I believe I have found the problem. Due to large imbalance between classes in my label column, at some point Ksplit arguably creates an empty train/test set for the minority label. This issue should be solved by using one of the oversampling split methods (KSampling recommended).

I suggest changing the documentation to put KSampling as the preferable split method for largely imbalanced data instead of Stratified. The documentation says:

Stratified Mode

    Stratified mode will balance all of the values present in the label column of a classification algorithm so that there is adequate coverage of all available labels in both train and test for each kfold split step.

It is HIGHLY RECOMMENDED to use this mode if there is a large skew in your label column (class imbalance) and there is a need for training on the full, unmanipulated data set.

Will open a PR tomorrow with the suggested change.

Dec 22 '21 20:12 ottobricks

automl-toolkit automl-toolkit copied to clipboard

Issue with DataSplitUtility repartition(0)

automl-toolkit
automl-toolkit copied to clipboard