category_encoders icon indicating copy to clipboard operation
category_encoders copied to clipboard

Metadata for encoders

Open janmotl opened this issue 6 years ago • 9 comments

It should be possible to programmatically differentiate between encoders that:

  1. Do not require any target during fitting (like OneHotEncoder).
  2. Require some target during fitting (like TargetEncoder).
  3. Require binary target during fitting (like WOE).

This information would be useful for:

  1. Parameterized tests.
  2. Users that wonder, which encoder they may use.

Proposed implementation: Like in scikit-learn.

janmotl avatar Jan 05 '19 20:01 janmotl

How would you do that with scikit-learn / how do you want to implement this?

amueller avatar Feb 05 '19 22:02 amueller

I though that there is some standardized way. But I did not find any reference. Options:

  1. Create a few abstract classes. And based on the inheritance we could learn the properties of the encoders.
  2. The encoders could implement a couple of methods like 'accepts_continuous_target()' and 'accepts_binary_target()' or implement a single method like supportsCapability in RapidMiner.
  3. The encoders could have attributes like 'accepts_continuous_target' and 'accepts_binary_target'.

janmotl avatar Feb 06 '19 15:02 janmotl

The easy / hackish thing to do would be inspecting the signature and see if y is a required argument. There's hopefully soon estimator tags in sklearn which will allow you to specify this kind of information but they are not really standardized yet.

amueller avatar Feb 06 '19 16:02 amueller

I found it: https://scikit-learn.org/stable/developers/contributing.html#estimator-types

janmotl avatar Feb 06 '19 16:02 janmotl

Yes, but that doesn't tell you for a transformation whether it requires y to fit (I probably wrote that section).

amueller avatar Feb 06 '19 22:02 amueller

In that case it looks like you are the right person to discuss it.

I am not picky about the used names or the exact mechanism in which it is implemented. The required functionality could be implemented with the following tags:

  • [ ] supports continuous target
  • [ ] supports binomial target

I used term "target" instead of the more common "label", because "target" works well for both, continuous and discrete dependent variables, while "label" is arguably appropriate only for discrete dependent variables.

While some of the encoders require the target to take values {0,1}, I believe that they should be refactored in some distant future to support targets like {'no', 'yes'}, {'negative', 'positive'} or any other set of exactly two values. Hence, I prefer the more general and future-proof term "binomial" instead of "binary", which would suggest that the target takes only values {0,1}.

But of course, the terminology taken by auto-sklearn is nice as well.

I am not sure how to handle non-target encoders. Options:

  1. introduce 'is_target_encoder' and 'is_encoder' as an analogy to 'is_classifier' and 'is_regressor'
  2. Return false for both tags.
  3. Something completely else.

Either way, following tests in 'test_encoders' should not need to hard-code names of the encoders anymore:

  1. test_impact_encoders
  2. test_tmp_column_name
  3. test_unique_column_is_not_predictive
  4. test_get_feature_names
  5. test_get_feature_names_drop_invariant

There is also one more tag that could be useful:

  • [ ] supports unknown/new target values in the test set

This tag would be nice in the following test:

  1. test_handle_unknown_error

janmotl avatar Feb 07 '19 08:02 janmotl

Another useful metadato could be:

  • [ ] supports inverse transform

This could help for example in test_inverse_transform

janmotl avatar Feb 11 '19 09:02 janmotl

Sorry for the slow reply. In sklearn we generally use the word target, and I don't think we use "binary" to be 0 and 1, we use it to mean any two-class classification problem. But you can also extend the encoders to multiclass pretty easily, right?

amueller avatar Mar 12 '19 20:03 amueller

Good to know that we are aligned. The binary encoders can be extended to work on multiclass. But I would hesitate to call it easy.

And I started to call target encoders "supervised" and non-target encoders "unsupervised". The advantage is that this terminology describes the encoders well and people are already familiar with the terms.

janmotl avatar Mar 13 '19 08:03 janmotl