NVTabular
NVTabular copied to clipboard
[FEA] Improve Target Encoding - Providing Leave-One-Out Strategy
Is your feature request related to a problem? Please describe.
Currently, we have an operator TargetEncoding, which provides a k-fold out of fold strategy. The training dataset is splitted into k chunks and the Target Encoding values are calculated by using all chunks without k to encode the k-th chunk.
This calculation can be inefficient for large datasets. If we have 100M users and using 5-fold strategy, we could calculate for 500M rows the statistics. As a user, I want to use a new method - leave-one-out strategy.
Train
TE_i = ((target_sum-target_i)+target_avg * smoothing_factor) / (target_count-1-smoothing_factor)
Valid
TE_i = ((target_sum)+target_avg * smoothing_factor) / (target_count-smoothing_factor)
Note, we need to differentiate the transform function, if we apply it to the train or valid dataset.