automl-toolkit
automl-toolkit copied to clipboard
Issue with DataSplitUtility repartition(0)
When following this tutorial, I encounter the following error during feature selection thrown by DataSplitUtility:
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
The thing I do differently from the tutorial is setting the trainTestSplitMethod
to "chronological" as in:
Map(
...
"tunerTrainSplitMethod" -> "chronological",
"tunerTrainSplitChronologicalColumn" -> "id",
"tunerTrainSplitChronologicalRandomPercentage" -> 0.25,
...
)
Any ideas on how to fix the issue?
I am using:
- Spark 3.2.0
- Hadoop 3.3.1
- Scala 2.12.15
- automl-toolkit 0.8.1
The relevant logs are:
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.catalyst.plans.logical.Repartition.<init>(basicLogicalOperators.scala:1372)
at org.apache.spark.sql.Dataset.repartition(Dataset.scala:3022)
at com.databricks.labs.automl.model.tools.split.SplitOperators$.optimizeTestTrain(SplitOperators.scala:371)
at com.databricks.labs.automl.model.tools.split.DataSplitUtility.$anonfun$trainSplitPersist$1(
DataSplitUtility.scala:108
)
at com.databricks.labs.automl.model.tools.split.DataSplitUtility.$anonfun$trainSplitPersist$1$adapted(
DataSplitUtility.scala:85
)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at com.databricks.labs.automl.model.tools.split.DataSplitUtility.trainSplitPersist(DataSplitUtility.scala:85)
at com.databricks.labs.automl.model.tools.split.DataSplitUtility.performSplit(DataSplitUtility.scala:248)
at com.databricks.labs.automl.model.tools.split.DataSplitUtility$.split(DataSplitUtility.scala:291)
at com.databricks.labs.automl.exploration.FeatureImportances.getImportances(
FeatureImportances.scala:171
)
at com.databricks.labs.automl.exploration.FeatureImportances.generateFeatureImportances(
FeatureImportances.scala:382
)
~~Since yesterday, I tried using FamilyRunner
and it works as long as I don't use "chronological" split method.~~
~~The error I get with FamilyRunner
is different from the above. In my understanding, DropColumnsTransformer
drops tunerTrainSplitChronologicalColumn
despite the fact that I add it to fieldsToIgnoreInVector
.~~
~~In my understanding, columns in fieldsToIgnoreInVector
should be left untouched by all transformers, but it doesn't seem to be the case. It is possible to spot the problem with the debug flag. In my experiment, tunerTrainSplitChronologicalColumn
-> "id_col", but it is not present in the step output dataset:~~
... ~~Output dataset schema: root~~ ~~ |-- label_col: integer (nullable = true)~~ ~~ |-- automl_internal_id: long (nullable = false)~~ ~~ |-- features: vector (nullable = true)~~
~~=== End of class com.databricks.labs.automl.pipeline.DropColumnsTransformer Pipeline Stage log <==~~
~~I will look deeper into this and open a PR to fix it.~~
EDIT: moved issue to its own page since the repartition(0) issue reported here still persists after moving to FamilyRunner with split method "stratified".
The same error persists with FamilyRunner. I am investigating the problem.
I believe I have found the problem. Due to large imbalance between classes in my label column, at some point Ksplit arguably creates an empty train/test set for the minority label. This issue should be solved by using one of the oversampling split methods (KSampling
recommended).
I suggest changing the documentation to put KSampling
as the preferable split method for largely imbalanced data instead of Stratified
. The documentation says:
Stratified Mode
Stratified mode will balance all of the values present in the label column of a classification algorithm so that there is adequate coverage of all available labels in both train and test for each kfold split step.
It is HIGHLY RECOMMENDED to use this mode if there is a large skew in your label column (class imbalance) and there is a need for training on the full, unmanipulated data set.
Will open a PR tomorrow with the suggested change.