category_encoders
category_encoders copied to clipboard
CountEncoder tests in tests/test_encoders.py are skipped
Expected Behavior
tests/test_encoders.py
should be running for all encoders.
Actual Behavior
CountEncoder is missing from __all__
in category_encoders/__init__.py
. This has the effect that tests are skipped.
11 of these tests fail for CountEncoder.
If CountEncoder is included, some of their tests are not going to pass: right now, CountEncoder doesn't implement some of the methods that are in the rest of the encoders, such as "get_feature_names".
I see, do you know if this is mentioned in the code or elsewhere? I could not find anything, so i suggest adding some remarks. Also I am just trying to implement the missing features and would be happy about some help.
Not helping with the code. But a few notes after looking at the documentation:
-
Consider renaming the argument value
'count'
inhandle_missing
to'value'
. Reasoning:'value'
string is used in other encoders, making it easier to experimentally test multiple encoders. And it allows you to usenormalized=True
argument without causing the conflict between the "count" string and what actually happens (calculation of the relative frequency?). -
Consider adding
'value'
value intohandle_unknown
, which would impute with0
, and set it as the default behaviour. Reasoning: It would improve compatibility with other encoders. -
The doc says
Defaults to None which uses NaN behaviour specified at fit time
. I am not sure that is correct (if nothing else,fit
method does not document the possibility to sethandle_unknown
). -
I am not sure why
_check_set_create_dict_attrs
has line'handle_unknown': 'count',
when'count'
can be currently set only inhandle_missing
and not inhandle_unknown
. -
The documentation repeatedly mentions
or dict of.
. A dict of what to what? -
The documentation mentions
The default name can be long ae may keep changing
. What isae
?
@janmotl thanks a lot, your input is very much appreciated and helpful to me.
I could fix most of the issues see PR #260.
1 + 2.) I renamed and added the option 'value'
as you suggested. I could get handle_missing to pass the tests but got stuck on handle_unknown. There is one test still failing test_encoders.py::TestEncoders::test_handle_unknown_value
.
3.) I think the Docs refer to the way how NaN is treated in fit
with handle_missing
.
4.) this was/is confusing for me also
5.) I amended the docs to dict of {column: option}
. As far as I understand you can choose a different option per column.
6.) I found this in a comment, it's just meant to be "and".
So the problems that still remain are handle_unknown_value and combine_min_nan_groups.
checking in with this after some while: Can we close this issue? As of today all tests work for count encoder
Great, yes we can close it!