feature_engine
feature_engine copied to clipboard
add new StringSimilarityEncoder transformer
connected to #530
Codecov Report
Merging #537 (330ce76) into main (780bdba) will increase coverage by
0.03%
. The diff coverage is98.66%
.
@@ Coverage Diff @@
## main #537 +/- ##
==========================================
+ Coverage 96.84% 96.88% +0.03%
==========================================
Files 89 89
Lines 3332 3400 +68
Branches 670 671 +1
==========================================
+ Hits 3227 3294 +67
Misses 45 45
- Partials 60 61 +1
Impacted Files | Coverage Δ | |
---|---|---|
feature_engine/encoding/similarity_encoder.py | 98.64% <98.64%> (ø) |
|
feature_engine/encoding/__init__.py | 100.00% <100.00%> (ø) |
|
feature_engine/__init__.py |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
Example notebook https://github.com/feature-engine/feature-engine-examples/pull/11
Should there be a copyright header?
hould there be a copyright head
What do you mean by copyright header?
hould there be a copyright head
What do you mean by copyright header?
For example this
https://github.com/feature-engine/feature_engine/blob/93cb37b6650f65281c44ec6d8555d3fb0e55a82c/feature_engine/encoding/one_hot.py#L1-L2
Hi @GLevV
Thank you very much for the contributions and for taking the time to go through my questions.
FYI: I made this PR to your repo: https://github.com/GLevV/feature_engine/pull/2
There a a few things that we need to look at:
As it is, the encoder is failing one of feature-engine's general tests, because you can't add "" in fillna to a categorical variable (encoded as categorical, instead of object). The value "" needs to be added as a category first. See for example how we handle that here: https://github.com/feature-engine/feature_engine/blob/93cb37b6650f65281c44ec6d8555d3fb0e55a82c/feature_engine/imputation/categorical.py#L210-L216
The test test_nan_behaviour_impute
tests that the nan are encoded. But it does not test the fact that nan are replaced by "".
At first, this may seem trivial, but if for example, another contributor decides that we should change "" by "_", while the output of the transformer may still make sense, this could break backward compatibility. If we have a test, we are making sure that this change is made intentionally, and not accidentally. Could we find a way to be able to test this?
This test: test_nan_behaviour_ignore
, if I understand correctly tests that we have the same number of nan in input and output. Ideally, we should test that the same observation was nan as input and nan as output. If transformer replaces this nan by something else, and introduces another nan, this test could still pass I think.
I do agree that this test might be a bit over the top. But would it be too hard to implement?
- I updated pandas and the local error I was getting is now solved.
- I tried to think how to go about this one, but I could not come up with a good solution, so we can let this one go.
- I think this one is not too important.
I am not sure what your thoughts are, but I think this PR could be good to go after you merge.
Let me know what you think
Cheers
I updated pandas and the local error I was getting is now solved.
I tried to think how to go about this one, but I could not come up with a good solution, so we can let this one go.
I think this one is not too important.
I am not sure what your thoughts are, but I think this PR could be good to go after you merge.
Let me know what you think
Cheers
I think for the first one there is an easy solution for SimilarityEncoder - cast column to str before value_counts and fillna. I was doing it in previous iterations but decided to go with the other approach. I can revert it, if it's needed.