spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-37178][ML] Add Target Encoding to ml.feature

Open rebo16v opened this issue 1 year ago • 2 comments

What changes were proposed in this pull request?

Adds support for target encoding of ml features. Target Encoding maps a column of categorical indices into a numerical feature derived from the target. Leveraging the relationship between categorical variables and the target variable, target encoding usually performs better than one-hot encoding (while avoiding the need to add extra columns)

Why are the changes needed?

Target Encoding is a well-known encoding technique for categorical features. It's supported on most ml frameworks https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html https://search.r-project.org/CRAN/refmans/dataPreparation/html/target_encode.html

Does this PR introduce any user-facing change?

Spark API now includes 2 new classes in package org.apache.spark.ml

  • TargetEncoder (estimator)
  • TargetEncoderModel (transformer)

How was this patch tested?

Scala => org.apache.spark.ml.feature.TargetEncoderSuite Java => org.apache.spark.ml.feature.JavaTargetEncoderSuite Python => python.pyspark.ml.tests.test_feature.FeatureTests (added 2 tests)

Was this patch authored or co-authored using generative AI tooling?

No

Some design notes ... |-

  • binary and continuous target types (no multi-label yet)

  • available in Scala, Java and Python APIs

  • fitting implemented on RDD API (treeAggregate)

  • transformation implemented on Dataframe API (no UDFs)

  • categorical features must be indices (integers) in Double-typed columns (as if StringIndexer were used before)

  • unseen categories in training are represented as class -1.0

  • Encodings structure

    • Map[String, Map[Double, Double]]) => Map[ feature_name, Map[ original_category, encoded category ] ]
  • Parameters

    • inputCol(s) / outputCol(s) / labelCol => as usual
    • targetType
      • binary => encodings calculated as in-category conditional probability (counting)
      • continuous => encodings calculated as in-category target mean (incrementally)
    • handleInvalid
      • error => raises an error if trying to encode an unseen category
      • keep => encodes an unseen category with the overall statistics
    • smoothing => controls how in-category stats and overall stats are weighted to calculate final encodings (to avoid overfitting)

rebo16v avatar Oct 04 '24 09:10 rebo16v

Let me call in @zhengruifeng for a look at this too. I think it's pretty good

srowen avatar Oct 19 '24 03:10 srowen

also cc @WeichenXu123 for visibility

zhengruifeng avatar Oct 20 '24 01:10 zhengruifeng

I think we should pass raw estimates to the model and calculate encodings in transform() So we can apply different smoothing factors without having to re-fit Makes sense? Will work on this ... https://github.com/apache/spark/blob/5a67c503ce1fea57de5429ff915783d14ba0f7cf/mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala#L253

rebo16v avatar Oct 21 '24 07:10 rebo16v

I think we should pass raw estimates to the model and calculate encodings in transform() So we can apply different smoothing factors without having to re-fit Makes sense? Will work on this ...

https://github.com/apache/spark/blob/5a67c503ce1fea57de5429ff915783d14ba0f7cf/mllib/src/main/scala/org/apache/spark/ml/feature/TargetEncoder.scala#L253

done!

rebo16v avatar Oct 23 '24 20:10 rebo16v

@srowen @zhengruifeng

rebo16v avatar Oct 27 '24 19:10 rebo16v

@zhengruifeng

rebo16v avatar Oct 28 '24 21:10 rebo16v

I think it looks good. There are 'failing' tests but it looks like a timeout. I'll run again to see if they complete. Anyone know about issues with the builder at the moment?

srowen avatar Nov 02 '24 13:11 srowen

@HyukjinKwon @zhengruifeng

rebo16v avatar Nov 06 '24 21:11 rebo16v

Merged to master.

HyukjinKwon avatar Nov 06 '24 23:11 HyukjinKwon