[SPARK-37178][ML] Add Target Encoding to ml.feature
What changes were proposed in this pull request?
Adds support for target encoding of ml features. Target Encoding maps a column of categorical indices into a numerical feature derived from the target. Leveraging the relationship between categorical variables and the target variable, target encoding usually performs better than one-hot encoding (while avoiding the need to add extra columns)
Why are the changes needed?
Target Encoding is a well-known encoding technique for categorical features. It's supported on most ml frameworks https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html https://search.r-project.org/CRAN/refmans/dataPreparation/html/target_encode.html
Does this PR introduce any user-facing change?
Spark API now includes 2 new classes in package org.apache.spark.ml
- TargetEncoder (estimator)
- TargetEncoderModel (transformer)
How was this patch tested?
Scala => org.apache.spark.ml.feature.TargetEncoderSuite Java => org.apache.spark.ml.feature.JavaTargetEncoderSuite Python => python.pyspark.ml.tests.test_feature.FeatureTests (added 2 tests)
Was this patch authored or co-authored using generative AI tooling?
No
Some design notes ... |-
-
binary and continuous target types (no multi-label yet)
-
available in Scala, Java and Python APIs
-
fitting implemented on RDD API (treeAggregate)
-
transformation implemented on Dataframe API (no UDFs)
-
categorical features must be indices (integers) in Double-typed columns (as if StringIndexer were used before)
-
unseen categories in training are represented as class -1.0
-
Encodings structure
- Map[String, Map[Double, Double]]) => Map[ feature_name, Map[ original_category, encoded category ] ]
-
Parameters
- inputCol(s) / outputCol(s) / labelCol => as usual
- targetType
- binary => encodings calculated as in-category conditional probability (counting)
- continuous => encodings calculated as in-category target mean (incrementally)
- handleInvalid
- error => raises an error if trying to encode an unseen category
- keep => encodes an unseen category with the overall statistics
- smoothing => controls how in-category stats and overall stats are weighted to calculate final encodings (to avoid overfitting)
Let me call in @zhengruifeng for a look at this too. I think it's pretty good
also cc @WeichenXu123 for visibility
I think we should pass raw estimates to the model and calculate encodings in transform() So we can apply different smoothing factors without having to re-fit Makes sense? Will work on this ... https://github.com/apache/spark/blob/5a67c503ce1fea57de5429ff915783d14ba0f7cf/mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala#L253
I think we should pass raw estimates to the model and calculate encodings in transform() So we can apply different smoothing factors without having to re-fit Makes sense? Will work on this ...
https://github.com/apache/spark/blob/5a67c503ce1fea57de5429ff915783d14ba0f7cf/mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala#L253
done!
@srowen @zhengruifeng
@zhengruifeng
I think it looks good. There are 'failing' tests but it looks like a timeout. I'll run again to see if they complete. Anyone know about issues with the builder at the moment?
@HyukjinKwon @zhengruifeng
Merged to master.