SynapseML
SynapseML copied to clipboard
Where to specify numClasses metadata in LightGBMClassifier?
My labelCol
is named label
Post running vectorassembler I run df_modeling = df_modeling.withColumn('label', col('label').alias('label',metadata={'numClasses':self.num_classes}))
and yet I get the warning:
com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier: com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier inferred 2 classes for labelCol=LightGBMClassifier_4f7ef3fb1833__labelCol since numClasses was not specified in the column metadata.
It seems an iteration over the entire labelCol
is executed unnecessarily to infer the num of classes.
How to properly specify numClasses
as metadata?
@Nitinsiwach that warning is coming from spark code when calling:
https://github.com/microsoft/SynapseML/blob/master/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/LightGBMClassifier.scala#L47
This is actually a standard thing for spark ml models and not unique to lightgbm.
You can find the method defined here in spark mllib codebase: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala#L142
Digging into the method implementation: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/util/MetadataUtils.scala#L38
It seems it just uses the categorical property on the label column. Hence, if you used string indexer: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html#pyspark.ml.feature.StringIndexer I think it would add this metadata automatically for you.
You can also try to add this metadata yourself, but it is a bit complex to do that. I'm not sure if you want to delve into manipulating spark metadata if you're not familiar with it. There might be a nice way to add this as a parameter to lightgbm and just pass it to the LightGBMClassifier as well, but I don't think other spark ml models do that.