feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

add new StringSimilarityEncoder transformer

Open glevv opened this issue 2 years ago • 5 comments

connected to #530

glevv avatar Sep 26 '22 15:09 glevv

Codecov Report

Merging #537 (330ce76) into main (780bdba) will increase coverage by 0.03%. The diff coverage is 98.66%.

@@            Coverage Diff             @@
##             main     #537      +/-   ##
==========================================
+ Coverage   96.84%   96.88%   +0.03%     
==========================================
  Files          89       89              
  Lines        3332     3400      +68     
  Branches      670      671       +1     
==========================================
+ Hits         3227     3294      +67     
  Misses         45       45              
- Partials       60       61       +1     
Impacted Files Coverage Δ
feature_engine/encoding/similarity_encoder.py 98.64% <98.64%> (ø)
feature_engine/encoding/__init__.py 100.00% <100.00%> (ø)
feature_engine/__init__.py

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Sep 26 '22 17:09 codecov[bot]

Example notebook https://github.com/feature-engine/feature-engine-examples/pull/11

glevv avatar Sep 27 '22 10:09 glevv

Should there be a copyright header?

glevv avatar Oct 04 '22 15:10 glevv

hould there be a copyright head

What do you mean by copyright header?

solegalli avatar Oct 05 '22 08:10 solegalli

hould there be a copyright head

What do you mean by copyright header?

For example this

https://github.com/feature-engine/feature_engine/blob/93cb37b6650f65281c44ec6d8555d3fb0e55a82c/feature_engine/encoding/one_hot.py#L1-L2

glevv avatar Oct 05 '22 09:10 glevv

Hi @GLevV

Thank you very much for the contributions and for taking the time to go through my questions.

FYI: I made this PR to your repo: https://github.com/GLevV/feature_engine/pull/2

There a a few things that we need to look at:

As it is, the encoder is failing one of feature-engine's general tests, because you can't add "" in fillna to a categorical variable (encoded as categorical, instead of object). The value "" needs to be added as a category first. See for example how we handle that here: https://github.com/feature-engine/feature_engine/blob/93cb37b6650f65281c44ec6d8555d3fb0e55a82c/feature_engine/imputation/categorical.py#L210-L216

The test test_nan_behaviour_impute tests that the nan are encoded. But it does not test the fact that nan are replaced by "". At first, this may seem trivial, but if for example, another contributor decides that we should change "" by "_", while the output of the transformer may still make sense, this could break backward compatibility. If we have a test, we are making sure that this change is made intentionally, and not accidentally. Could we find a way to be able to test this?

This test: test_nan_behaviour_ignore, if I understand correctly tests that we have the same number of nan in input and output. Ideally, we should test that the same observation was nan as input and nan as output. If transformer replaces this nan by something else, and introduces another nan, this test could still pass I think.

I do agree that this test might be a bit over the top. But would it be too hard to implement?

solegalli avatar Oct 11 '22 12:10 solegalli

  1. I updated pandas and the local error I was getting is now solved.
  2. I tried to think how to go about this one, but I could not come up with a good solution, so we can let this one go.
  3. I think this one is not too important.

I am not sure what your thoughts are, but I think this PR could be good to go after you merge.

Let me know what you think

Cheers

solegalli avatar Oct 12 '22 09:10 solegalli

  1. I updated pandas and the local error I was getting is now solved.

    1. I tried to think how to go about this one, but I could not come up with a good solution, so we can let this one go.

    2. I think this one is not too important.

I am not sure what your thoughts are, but I think this PR could be good to go after you merge.

Let me know what you think

Cheers

I think for the first one there is an easy solution for SimilarityEncoder - cast column to str before value_counts and fillna. I was doing it in previous iterations but decided to go with the other approach. I can revert it, if it's needed.

glevv avatar Oct 12 '22 11:10 glevv