category_encoders icon indicating copy to clipboard operation
category_encoders copied to clipboard

CountEncoder tests in tests/test_encoders.py are skipped

Open pyrrhull opened this issue 4 years ago • 5 comments

Expected Behavior

tests/test_encoders.py should be running for all encoders.

Actual Behavior

CountEncoder is missing from __all__ in category_encoders/__init__.py. This has the effect that tests are skipped. 11 of these tests fail for CountEncoder.

pyrrhull avatar Jul 02 '20 22:07 pyrrhull

If CountEncoder is included, some of their tests are not going to pass: right now, CountEncoder doesn't implement some of the methods that are in the rest of the encoders, such as "get_feature_names".

lbcommer avatar Jul 03 '20 09:07 lbcommer

I see, do you know if this is mentioned in the code or elsewhere? I could not find anything, so i suggest adding some remarks. Also I am just trying to implement the missing features and would be happy about some help.

pyrrhull avatar Jul 03 '20 10:07 pyrrhull

Not helping with the code. But a few notes after looking at the documentation:

  1. Consider renaming the argument value 'count' in handle_missing to 'value'. Reasoning: 'value' string is used in other encoders, making it easier to experimentally test multiple encoders. And it allows you to use normalized=True argument without causing the conflict between the "count" string and what actually happens (calculation of the relative frequency?).

  2. Consider adding 'value' value into handle_unknown, which would impute with 0, and set it as the default behaviour. Reasoning: It would improve compatibility with other encoders.

  3. The doc says Defaults to None which uses NaN behaviour specified at fit time. I am not sure that is correct (if nothing else, fit method does not document the possibility to set handle_unknown).

  4. I am not sure why _check_set_create_dict_attrs has line 'handle_unknown': 'count', when 'count' can be currently set only in handle_missing and not in handle_unknown.

  5. The documentation repeatedly mentions or dict of.. A dict of what to what?

  6. The documentation mentions The default name can be long ae may keep changing. What is ae?

janmotl avatar Jul 03 '20 11:07 janmotl

@janmotl thanks a lot, your input is very much appreciated and helpful to me.

pyrrhull avatar Jul 03 '20 12:07 pyrrhull

I could fix most of the issues see PR #260. 1 + 2.) I renamed and added the option 'value' as you suggested. I could get handle_missing to pass the tests but got stuck on handle_unknown. There is one test still failing test_encoders.py::TestEncoders::test_handle_unknown_value. 3.) I think the Docs refer to the way how NaN is treated in fit with handle_missing. 4.) this was/is confusing for me also 5.) I amended the docs to dict of {column: option}. As far as I understand you can choose a different option per column. 6.) I found this in a comment, it's just meant to be "and".

So the problems that still remain are handle_unknown_value and combine_min_nan_groups.

pyrrhull avatar Jul 10 '20 14:07 pyrrhull

checking in with this after some while: Can we close this issue? As of today all tests work for count encoder

PaulWestenthanner avatar Jan 06 '23 13:01 PaulWestenthanner

Great, yes we can close it!

pyrrhull avatar Jan 09 '23 20:01 pyrrhull