spark [SPARK-37178][ML] Add Target Encoding to ml.feature

What changes were proposed in this pull request?

Adds support for target encoding of ml features. Target Encoding maps a column of categorical indices into a numerical feature derived from the target. Leveraging the relationship between categorical variables and the target variable, target encoding usually performs better than one-hot encoding (while avoiding the need to add extra columns)

Why are the changes needed?

Target Encoding is a well-known encoding technique for categorical features. It's supported on most ml frameworks https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html https://search.r-project.org/CRAN/refmans/dataPreparation/html/target_encode.html

Does this PR introduce any user-facing change?

Spark API now includes 2 new classes in package org.apache.spark.ml

TargetEncoder (estimator)
TargetEncoderModel (transformer)

How was this patch tested?

Scala => org.apache.spark.ml.feature.TargetEncoderSuite Java => org.apache.spark.ml.feature.JavaTargetEncoderSuite Python => python.pyspark.ml.tests.test_feature.FeatureTests (added 2 tests)

Was this patch authored or co-authored using generative AI tooling?

No

Some design notes ... |-

binary and continuous target types (no multi-label yet)
available in Scala, Java and Python APIs
fitting implemented on RDD API (treeAggregate)
transformation implemented on Dataframe API (no UDFs)
categorical features must be indices (integers) in Double-typed columns (as if StringIndexer were used before)
unseen categories in training are represented as class -1.0
Encodings structure
- Map[String, Map[Double, Double]]) => Map[ feature_name, Map[ original_category, encoded category ] ]
Parameters
- inputCol(s) / outputCol(s) / labelCol => as usual
- targetType
  - binary => encodings calculated as in-category conditional probability (counting)
  - continuous => encodings calculated as in-category target mean (incrementally)
- handleInvalid
  - error => raises an error if trying to encode an unseen category
  - keep => encodes an unseen category with the overall statistics
- smoothing => controls how in-category stats and overall stats are weighted to calculate final encodings (to avoid overfitting)

Oct 04 '24 09:10 rebo16v

Let me call in @zhengruifeng for a look at this too. I think it's pretty good

Oct 19 '24 03:10 srowen

also cc @WeichenXu123 for visibility

Oct 20 '24 01:10 zhengruifeng

I think we should pass raw estimates to the model and calculate encodings in transform() So we can apply different smoothing factors without having to re-fit Makes sense? Will work on this ... https://github.com/apache/spark/blob/5a67c503ce1fea57de5429ff915783d14ba0f7cf/mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala#L253

Oct 21 '24 07:10 rebo16v

I think we should pass raw estimates to the model and calculate encodings in transform() So we can apply different smoothing factors without having to re-fit Makes sense? Will work on this ...

https://github.com/apache/spark/blob/5a67c503ce1fea57de5429ff915783d14ba0f7cf/mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala#L253

done!

Oct 23 '24 20:10 rebo16v

@srowen @zhengruifeng

Oct 27 '24 19:10 rebo16v

@zhengruifeng

Oct 28 '24 21:10 rebo16v

I think it looks good. There are 'failing' tests but it looks like a timeout. I'll run again to see if they complete. Anyone know about issues with the builder at the moment?

Nov 02 '24 13:11 srowen

@HyukjinKwon @zhengruifeng

Nov 06 '24 21:11 rebo16v

Merged to master.

Nov 06 '24 23:11 HyukjinKwon