category_encoders
category_encoders copied to clipboard
CountEncoder tests in tests/test_encoders.py are skipped
Expected Behavior
tests/test_encoders.py should be running for all encoders.
Actual Behavior
CountEncoder is missing from __all__ in category_encoders/__init__.py. This has the effect that tests are skipped.
11 of these tests fail for CountEncoder.
If CountEncoder is included, some of their tests are not going to pass: right now, CountEncoder doesn't implement some of the methods that are in the rest of the encoders, such as "get_feature_names".
I see, do you know if this is mentioned in the code or elsewhere? I could not find anything, so i suggest adding some remarks. Also I am just trying to implement the missing features and would be happy about some help.
Not helping with the code. But a few notes after looking at the documentation:
-
Consider renaming the argument value
'count'inhandle_missingto'value'. Reasoning:'value'string is used in other encoders, making it easier to experimentally test multiple encoders. And it allows you to usenormalized=Trueargument without causing the conflict between the "count" string and what actually happens (calculation of the relative frequency?). -
Consider adding
'value'value intohandle_unknown, which would impute with0, and set it as the default behaviour. Reasoning: It would improve compatibility with other encoders. -
The doc says
Defaults to None which uses NaN behaviour specified at fit time. I am not sure that is correct (if nothing else,fitmethod does not document the possibility to sethandle_unknown). -
I am not sure why
_check_set_create_dict_attrshas line'handle_unknown': 'count',when'count'can be currently set only inhandle_missingand not inhandle_unknown. -
The documentation repeatedly mentions
or dict of.. A dict of what to what? -
The documentation mentions
The default name can be long ae may keep changing. What isae?
@janmotl thanks a lot, your input is very much appreciated and helpful to me.
I could fix most of the issues see PR #260.
1 + 2.) I renamed and added the option 'value' as you suggested. I could get handle_missing to pass the tests but got stuck on handle_unknown. There is one test still failing test_encoders.py::TestEncoders::test_handle_unknown_value.
3.) I think the Docs refer to the way how NaN is treated in fit with handle_missing.
4.) this was/is confusing for me also
5.) I amended the docs to dict of {column: option}. As far as I understand you can choose a different option per column.
6.) I found this in a comment, it's just meant to be "and".
So the problems that still remain are handle_unknown_value and combine_min_nan_groups.
checking in with this after some while: Can we close this issue? As of today all tests work for count encoder
Great, yes we can close it!