category_encoders
                                
                                
                                
                                    category_encoders copied to clipboard
                            
                            
                            
                        WIP: Multi_hot encoder for ambiguous inputs
Summary
Implement fit and transform function of multi-hot encoding for ambiguous|dirty categorical feature.
#161
I hope you to check the usefulness.
Nice tests. Just nitpicking:
- In 
test_multi_hot_fitshouldn't: 
self.assertEqual(enc.transform(X_t).shape[1],
                         enc.transform(X_t[X_t['extra'] != 'A']).shape[1],
                         'We have to get the same count of columns')
become
self.assertEqual(enc.transform(X_t).shape[1],
                         enc.transform(X_t[X_t['extra'] != 'D']).shape[1],  # Without the new value. Alternatively we can compare to: enc.transform(X).shape[1]
                         'We have to get the same count of columns')
?
first_extract_column = out.columns.str.extract("(.*)_[1-9]+").dropna()[0].unique()[0]
should possibly be:
first_extract_column = out.columns.str.extract("(.*)_[0-9]+").dropna()[0].unique()[0]
- 
The order of the suffixes in the output seems to be pseudorandom. I am ok with that if the order is guaranteed not to change. But I would prefer to put them in order.
 - 
Argument
use_cat_namesseems to be ignored. I suggest to either remove it or test it that it actually works. - 
On python 2.7, some of the tests fail.
 
Thank you for your reviews.
- 
The transformation test was based on
test_one_hot.py. I inserted your suggestion into my test code. - 
Suffixes start with
1, not0in the encoder, so it would be no problem to writeout.columns.str.extract("(.*)_[1-9]+"). Missing values are encoded as other values. - 
Oh, that was my mistake. I fixed the order of suffixes.
 - 
I removed
use_cat_namesfor simplicity. - 
I fixed several bugs. All tests have passed in my latest commit.
 
Good work.
The transformation test was based on test_one_hot.py.
That's actually a mistake of mine in test_one_hot.py. I will fix it.
Suffixes start with 1, not 0 in the encoder, so it would be no problem to write out.columns.str.extract("(.*)_[1-9]+") . Missing values are encoded as other values.
I see. I was concerned about strings like "extra_10", which would not get captured. But it is merely a hypothetical concern.
I am getting a warning in both, Python2 and Python3 travis-ci report:
/home/travis/build/scikit-learn-contrib/categorical-encoding/category_encoders/tests/test_multi_hot.py:78: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
  extra_mask_columns = out.columns.str.extract("(extra_.*)_[1-9]+").dropna()
Maybe it could be silenced by setting the expand argument.
multiple_split_string and inv_map in get_dummies() look to be unused.
It looks like there are typos in the documentation of the examples: normilize -> normalize numetic_dataset -> numeric_dataset numetic_normalized_dataset - > numeric_normalized_dataset
I would consider moving create_boston_RAD and run_example into test_multi_hot.py. Or into examples directory and writing a note in the documentation where to find the example. But I am leaving it up to you - if you want it in the encoder file, it will be in the encoder.
@fullflu Please, check conformance of MultiHotEncoder to the changes in the master. All these changes were about handle_missing and handle_unknown arguments, which should be supported by all encoders. Note that not all the options have to be implemented. But the arguments should be there and the documentation should describe the default behaviour.
That's actually a mistake of mine in test_one_hot.py. I will fix it.
LGTM.
I see. I was concerned about strings like "extra_10", which would not get captured. But it is merely a hypothetical concern.
I tested it in local environment and open browser ( https://rubular.com/ ). I confirmed the string 'extra_10' is extracted as 'extra'. It would be no problem.
Maybe it could be silenced by setting the expand argument.
I will check this if necessary. How serious is this warning?
multiple_split_string and inv_map in get_dummies() look to be unused.
I removed them.
It looks like there are typos in the documentation of the examples: normilize -> normalize numetic_dataset -> numeric_dataset numetic_normalized_dataset - > numeric_normalized_dataset
They were my typos. I fixed them.
I would consider moving create_boston_RAD and run_example into test_multi_hot.py. Or into examples directory and writing a note in the documentation where to find the example. But I am leaving it up to you - if you want it in the encoder file, it will be in the encoder
It would be nice to put them in examples directory. I will fix it until writing a blog post.
Please, check conformance of MultiHotEncoder to the changes in the master. All these changes were about handle_missing and handle_unknown arguments, which should be supported by all encoders. Note that not all the options have to be implemented. But the arguments should be there and the documentation should describe the default behaviour.
Oh, I have forked an old code where missing_impute was used instead of handle_missing.
I added handle_missing argument and wrote documentation about handle_missing and handle_unknown arguments, but 3 tests (test_handle_missing_return_nan_test, test_handle_missing_return_nan_train and test_handle_unknown_return_nan) failed.
Should I consider return_nan option of these arguments?
Good.
I will check this if necessary. How serious is this warning?
I just attempt to keep the test results free of errors and warnings - once I allow one warning, additional warnings tend to lure in.
Should I consider return_nan option of these arguments?
Without looking at the code: isn't it enough to rename ignore option to return_nan?
I just attempt to keep the test results free of errors and warnings - once I allow one warning, additional warnings tend to lure in.
I got it. I added expand argument and warning was silenced.
Without looking at the code: isn't it enough to rename ignore option to return_nan?
In my code, that renaming is not enough.
I renamed ignore to value and  reproduced return_nan result of OneHotEncoder.
All tests have passed!
Awesome. Move the example. And I will merge it.
Note: Just write somewhere that | assumes uniform distribution of the feature values. For example, when the data contain 1|2, the encoder assumes that there is 0.5 probability that the real value is 1 and 0.5 probability that the real value is 2, even when the distribution of 1 and 2 in the training data is distinctly non-uniform.
Can you also write somewhere a real-world application of this encoder? When does it happen that we know, e.g.: 1|2? And can you illustrate on the blog that multi-hot-encoder can beat one-hot-encoder by a lot? I just cannot wait to see the post :)
Just write somewhere that | assumes uniform distribution of the feature values. For example, when the data contain 1|2, the encoder assumes that there is 0.5 probability that the real value is 1 and 0.5 probability that the real value is 2, even when the distribution of 1 and 2 in the training data is distinctly non-uniform
Exactly, that is a very important point. I added a prior option to solve the problem:
- train: each probability is trained using input data when fitting
 - uniform: each probability is assumed to be uniform as I have ever implemented
 
I also added a default_prior option, which can be used with the prior option at the same time.
- If a column is included in default_prior dictionary, prior is fixed as default_prior.
 - Else, prior is calculated using the prior option.
 
The problem of probability distribution would be solved by these options.
(Since the name 'prior' may be confusing, it will be renamed if necessary.)
Can you also write somewhere a real-world application of this encoder? When does it happen that we know, e.g.: 1|2? And can you illustrate on the blog that multi-hot-encoder can beat one-hot-encoder by a lot? I just cannot wait to see the post :)
I'm trying to compare the encoder performance using Boston and Titanic dataset, but the comparison is now based on artificial preprocessing by masking several rows. I'm sorry that I have not found any good real-world dataset which is available for free and naturally contains ambiguous categorical features... (I will search such a dataset)
[FYI]: In my experience, the situation when ambiguous features are obtained was caused by the change of data-acquisition process. After a certain day, the granularity of a feature have actually changed. I believe that such a dirty feature is generated in various business field.
Nice. I like the use of assert_frame_equal() (I didn't know that it existed). And that you wrote the default settings in the documentation.
I propose to rename the optional argument names to something like:
distribution, which accepts {'uniform', 'prior'}prior(if not provided, it is estimated from the training data as arithmetic mean of the target for each attribute value)
But I am leaving the final pick up to you.
In my experience, the situation when ambiguous features are obtained was caused by the change of data-acquisition process. After a certain day, the granularity of a feature have actually changed. I believe that such a dirty feature is generated in various business field.
That's a nice example.
Thank you for your nice suggestion.
This is the next plan:
- Coming soon (within a few weeks at the latest)
- rename multiple_split_string to or_delimiter
 - rename and fix prior-related options as your suggestion
 - create examples and blog-like posts
 
 - Future enhancement
- implement and_delimiter
 - integrate missing-value imputation methods (if possible)
 - integrate information theoretic methods (if possible)
 
 
Details of future enhancement
The ambiguity problem would be inherently included in missing-value imputation problem. The encoding method that I have implemented is based on the empirical distribution, however, other imputation methods based on machine-learning can be integrated with this delimiter-based multi-hot encoding. I expect someone who is interested in this topic could contribute such new encoding methods in the future!
implement and_delimiter
How is it going to work? Is it similar to TfidfVectorizer or CountVectorizer? A potentially useful dataset for the functionality illustration: data, description. In my opinion, it is a pretty dirty dataset. But if the encoder is going to work well on this, it's likely going to work well on many other datasets.
integrate missing-value imputation methods
It's up to you. The canonical solution is to propagate NaN to the output. And then use some canned solution for missing value imputation. But I can imagine missing value treatments, that would not work without seeing the raw data.
integrate information theoretic methods
The canonical solution for change of granularity in the data would be to use a hierarchical model (a.k.a mixed model). But there are many alternatives.
rename and fix prior-related options as your suggestion
Although your suggestion is so cool, I found that your prior-related options would discard the flexibility. That is why I want to remain the prior options. The details are described below. Feel free to correct if the description is wrong.
The option that I implemented can consider 5 cases:
- all cols are transformed by uniform distribution (prior is 'uniform' and default_prior is None)
 - all cols are transformed by empirical distribution (prior is 'train' and default_prior is None)
 - all cols are transformed by default_prior (all cols are included in default_prior)
 - several cols are transformed by default_prior and others are transformed by empirical distribution (prior is 'train' and default_prior is not None)
 - several cols are transformed by default_prior and others are transformed by uniform prior (prior is 'uniform' and default_prior is not None)
 
Your suggestion would not be able to consider the 5th option above.
This is a tradeoff between flexibility and simplicity. If there is another way in which both the flexibility and simplicity are achieved, I would adopt the options.
How is it going to work? Is it similar to TfidfVectorizer or CountVectorizer?
I have imagined simpler than them.
For each column, all rows where and_delimiter is included are transformed by multi-hot encoder without normalization. This process will reflect the meaning A and B, and it would be easily implemented.
integrate missing-value imputation methods integrate information theoretic methods
The canonical solution you described will be nice. Since I do not have enough solution to solve them, I will survey related work. These two methods can be out of scope unless any good paper is found.
dictionary used as prior (hyperprior is [1,1,1,...],...
Nice touch with the hyperprior.
Hi, @fullflu. Is there something I can help you with?