category_encoders
category_encoders copied to clipboard
Target encoding a feature where multiple values are allowed?
I don't believe I saw it, but does this library currently handle this sort of data? The simplest example I can think of is items in a shopping cart. Order of items don't matter, more than one item can be in the cart, there are many possible values, and many many unique combinations.
If support for this doesn't exist, I am curious as to whether target encoding would be a reasonable approach for representing this data to feed into some scikit-learn algorithms, and would the array of values need to be reduced into a single value? Thanks
The library does not currently support that. But PRs are always welcomed.
A simple workaround is to sort all items in the cart alphabetically, so [A,B] and [B,A] will always be represented by [A,B]. And then treat the values in the cart as strings. Then you can use TargetEncoder. The good thing is that the downstream model can still model interactions between the items. The disadvantage is that the content of many carts will appear just once in the whole dataset, even if the individual items in the carts are common.
If you can spend a while developing, than you may consider so called "propositionalization":
- Represent the whole dataset as a sparse matrix, where rows are carts and columns are items. If the item appears in the cart, the cell contains 1 (or the count of the purchased items), otherwise it contains 0.
- Use TargetEncoder to train the weights for each item.
- Calculate basic aggregate statistics like min, max, mean and standard deviation for each cart. And pass these aggregates as features to the downstream model.
Finally, you may consider usage of a model that can work directly with sparse data.
Either way, this sounds like a use-case, which will reaper many times -> we are interested into learning what worked (and what didn't).