category_encoders
category_encoders copied to clipboard
Metadata for encoders
It should be possible to programmatically differentiate between encoders that:
- Do not require any target during fitting (like OneHotEncoder).
- Require some target during fitting (like TargetEncoder).
- Require binary target during fitting (like WOE).
This information would be useful for:
- Parameterized tests.
- Users that wonder, which encoder they may use.
Proposed implementation: Like in scikit-learn.
How would you do that with scikit-learn / how do you want to implement this?
I though that there is some standardized way. But I did not find any reference. Options:
- Create a few abstract classes. And based on the inheritance we could learn the properties of the encoders.
- The encoders could implement a couple of methods like 'accepts_continuous_target()' and 'accepts_binary_target()' or implement a single method like supportsCapability in RapidMiner.
- The encoders could have attributes like 'accepts_continuous_target' and 'accepts_binary_target'.
The easy / hackish thing to do would be inspecting the signature and see if y is a required argument. There's hopefully soon estimator tags in sklearn which will allow you to specify this kind of information but they are not really standardized yet.
I found it: https://scikit-learn.org/stable/developers/contributing.html#estimator-types
Yes, but that doesn't tell you for a transformation whether it requires y to fit (I probably wrote that section).
In that case it looks like you are the right person to discuss it.
I am not picky about the used names or the exact mechanism in which it is implemented. The required functionality could be implemented with the following tags:
- [ ] supports continuous target
- [ ] supports binomial target
I used term "target" instead of the more common "label", because "target" works well for both, continuous and discrete dependent variables, while "label" is arguably appropriate only for discrete dependent variables.
While some of the encoders require the target to take values {0,1}, I believe that they should be refactored in some distant future to support targets like {'no', 'yes'}, {'negative', 'positive'} or any other set of exactly two values. Hence, I prefer the more general and future-proof term "binomial" instead of "binary", which would suggest that the target takes only values {0,1}.
But of course, the terminology taken by auto-sklearn is nice as well.
I am not sure how to handle non-target encoders. Options:
- introduce 'is_target_encoder' and 'is_encoder' as an analogy to 'is_classifier' and 'is_regressor'
- Return false for both tags.
- Something completely else.
Either way, following tests in 'test_encoders' should not need to hard-code names of the encoders anymore:
- test_impact_encoders
- test_tmp_column_name
- test_unique_column_is_not_predictive
- test_get_feature_names
- test_get_feature_names_drop_invariant
There is also one more tag that could be useful:
- [ ] supports unknown/new target values in the test set
This tag would be nice in the following test:
- test_handle_unknown_error
Another useful metadato could be:
- [ ] supports inverse transform
This could help for example in test_inverse_transform
Sorry for the slow reply. In sklearn we generally use the word target, and I don't think we use "binary" to be 0 and 1, we use it to mean any two-class classification problem. But you can also extend the encoders to multiclass pretty easily, right?
Good to know that we are aligned. The binary encoders can be extended to work on multiclass. But I would hesitate to call it easy.
And I started to call target encoders "supervised" and non-target encoders "unsupervised". The advantage is that this terminology describes the encoders well and people are already familiar with the terms.