NVTabular icon indicating copy to clipboard operation
NVTabular copied to clipboard

[FEA] Improve Target Encoding - Providing Leave-One-Out Strategy

Open bschifferer opened this issue 3 years ago • 0 comments

Is your feature request related to a problem? Please describe. Currently, we have an operator TargetEncoding, which provides a k-fold out of fold strategy. The training dataset is splitted into k chunks and the Target Encoding values are calculated by using all chunks without k to encode the k-th chunk.

This calculation can be inefficient for large datasets. If we have 100M users and using 5-fold strategy, we could calculate for 500M rows the statistics. As a user, I want to use a new method - leave-one-out strategy.

Train

TE_i = ((target_sum-target_i)+target_avg * smoothing_factor) / (target_count-1-smoothing_factor)

Valid

TE_i = ((target_sum)+target_avg * smoothing_factor) / (target_count-smoothing_factor)

Note, we need to differentiate the transform function, if we apply it to the train or valid dataset.

bschifferer avatar Sep 18 '22 13:09 bschifferer