SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

Where to specify numClasses metadata in LightGBMClassifier?

Open Nitinsiwach opened this issue 2 years ago • 2 comments

My labelCol is named label

Post running vectorassembler I run df_modeling = df_modeling.withColumn('label', col('label').alias('label',metadata={'numClasses':self.num_classes})) and yet I get the warning: com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier: com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier inferred 2 classes for labelCol=LightGBMClassifier_4f7ef3fb1833__labelCol since numClasses was not specified in the column metadata.

It seems an iteration over the entire labelCol is executed unnecessarily to infer the num of classes.

How to properly specify numClasses as metadata?

Nitinsiwach avatar Feb 14 '22 07:02 Nitinsiwach

@Nitinsiwach that warning is coming from spark code when calling:

https://github.com/microsoft/SynapseML/blob/master/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/LightGBMClassifier.scala#L47

This is actually a standard thing for spark ml models and not unique to lightgbm.

You can find the method defined here in spark mllib codebase: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala#L142

Digging into the method implementation: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/util/MetadataUtils.scala#L38

It seems it just uses the categorical property on the label column. Hence, if you used string indexer: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html#pyspark.ml.feature.StringIndexer I think it would add this metadata automatically for you.

imatiach-msft avatar Feb 16 '22 04:02 imatiach-msft

You can also try to add this metadata yourself, but it is a bit complex to do that. I'm not sure if you want to delve into manipulating spark metadata if you're not familiar with it. There might be a nice way to add this as a parameter to lightgbm and just pass it to the LightGBMClassifier as well, but I don't think other spark ml models do that.

imatiach-msft avatar Feb 16 '22 04:02 imatiach-msft