category_encoders icon indicating copy to clipboard operation
category_encoders copied to clipboard

WIP: Multi_hot encoder for ambiguous inputs

Open fullflu opened this issue 6 years ago • 18 comments

Summary

Implement fit and transform function of multi-hot encoding for ambiguous|dirty categorical feature.

#161

I hope you to check the usefulness.

fullflu avatar Jan 02 '19 14:01 fullflu

Nice tests. Just nitpicking:

  1. In test_multi_hot_fit shouldn't:
self.assertEqual(enc.transform(X_t).shape[1],
                         enc.transform(X_t[X_t['extra'] != 'A']).shape[1],
                         'We have to get the same count of columns')

become

self.assertEqual(enc.transform(X_t).shape[1],
                         enc.transform(X_t[X_t['extra'] != 'D']).shape[1],  # Without the new value. Alternatively we can compare to: enc.transform(X).shape[1]
                         'We have to get the same count of columns')

?

first_extract_column = out.columns.str.extract("(.*)_[1-9]+").dropna()[0].unique()[0]

should possibly be:

first_extract_column = out.columns.str.extract("(.*)_[0-9]+").dropna()[0].unique()[0]
  1. The order of the suffixes in the output seems to be pseudorandom. I am ok with that if the order is guaranteed not to change. But I would prefer to put them in order.

  2. Argument use_cat_names seems to be ignored. I suggest to either remove it or test it that it actually works.

  3. On python 2.7, some of the tests fail.

janmotl avatar Jan 02 '19 19:01 janmotl

Thank you for your reviews.

  1. The transformation test was based on test_one_hot.py. I inserted your suggestion into my test code.

  2. Suffixes start with 1, not 0 in the encoder, so it would be no problem to write out.columns.str.extract("(.*)_[1-9]+") . Missing values are encoded as other values.

  3. Oh, that was my mistake. I fixed the order of suffixes.

  4. I removed use_cat_names for simplicity.

  5. I fixed several bugs. All tests have passed in my latest commit.

fullflu avatar Jan 04 '19 08:01 fullflu

Good work.

The transformation test was based on test_one_hot.py.

That's actually a mistake of mine in test_one_hot.py. I will fix it.

Suffixes start with 1, not 0 in the encoder, so it would be no problem to write out.columns.str.extract("(.*)_[1-9]+") . Missing values are encoded as other values.

I see. I was concerned about strings like "extra_10", which would not get captured. But it is merely a hypothetical concern.

I am getting a warning in both, Python2 and Python3 travis-ci report:

/home/travis/build/scikit-learn-contrib/categorical-encoding/category_encoders/tests/test_multi_hot.py:78: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
  extra_mask_columns = out.columns.str.extract("(extra_.*)_[1-9]+").dropna()

Maybe it could be silenced by setting the expand argument.

multiple_split_string and inv_map in get_dummies() look to be unused.

It looks like there are typos in the documentation of the examples: normilize -> normalize numetic_dataset -> numeric_dataset numetic_normalized_dataset - > numeric_normalized_dataset

I would consider moving create_boston_RAD and run_example into test_multi_hot.py. Or into examples directory and writing a note in the documentation where to find the example. But I am leaving it up to you - if you want it in the encoder file, it will be in the encoder.

janmotl avatar Jan 04 '19 10:01 janmotl

@fullflu Please, check conformance of MultiHotEncoder to the changes in the master. All these changes were about handle_missing and handle_unknown arguments, which should be supported by all encoders. Note that not all the options have to be implemented. But the arguments should be there and the documentation should describe the default behaviour.

janmotl avatar Jan 04 '19 10:01 janmotl

That's actually a mistake of mine in test_one_hot.py. I will fix it.

LGTM.

I see. I was concerned about strings like "extra_10", which would not get captured. But it is merely a hypothetical concern.

I tested it in local environment and open browser ( https://rubular.com/ ). I confirmed the string 'extra_10' is extracted as 'extra'. It would be no problem.

Maybe it could be silenced by setting the expand argument.

I will check this if necessary. How serious is this warning?

multiple_split_string and inv_map in get_dummies() look to be unused.

I removed them.

It looks like there are typos in the documentation of the examples: normilize -> normalize numetic_dataset -> numeric_dataset numetic_normalized_dataset - > numeric_normalized_dataset

They were my typos. I fixed them.

I would consider moving create_boston_RAD and run_example into test_multi_hot.py. Or into examples directory and writing a note in the documentation where to find the example. But I am leaving it up to you - if you want it in the encoder file, it will be in the encoder

It would be nice to put them in examples directory. I will fix it until writing a blog post.

Please, check conformance of MultiHotEncoder to the changes in the master. All these changes were about handle_missing and handle_unknown arguments, which should be supported by all encoders. Note that not all the options have to be implemented. But the arguments should be there and the documentation should describe the default behaviour.

Oh, I have forked an old code where missing_impute was used instead of handle_missing. I added handle_missing argument and wrote documentation about handle_missing and handle_unknown arguments, but 3 tests (test_handle_missing_return_nan_test, test_handle_missing_return_nan_train and test_handle_unknown_return_nan) failed. Should I consider return_nan option of these arguments?

fullflu avatar Jan 04 '19 15:01 fullflu

Good.

I will check this if necessary. How serious is this warning?

I just attempt to keep the test results free of errors and warnings - once I allow one warning, additional warnings tend to lure in.

Should I consider return_nan option of these arguments?

Without looking at the code: isn't it enough to rename ignore option to return_nan?

janmotl avatar Jan 04 '19 15:01 janmotl

I just attempt to keep the test results free of errors and warnings - once I allow one warning, additional warnings tend to lure in.

I got it. I added expand argument and warning was silenced.

Without looking at the code: isn't it enough to rename ignore option to return_nan?

In my code, that renaming is not enough. I renamed ignore to value and reproduced return_nan result of OneHotEncoder. All tests have passed!

fullflu avatar Jan 04 '19 16:01 fullflu

Awesome. Move the example. And I will merge it.

Note: Just write somewhere that | assumes uniform distribution of the feature values. For example, when the data contain 1|2, the encoder assumes that there is 0.5 probability that the real value is 1 and 0.5 probability that the real value is 2, even when the distribution of 1 and 2 in the training data is distinctly non-uniform.

Can you also write somewhere a real-world application of this encoder? When does it happen that we know, e.g.: 1|2? And can you illustrate on the blog that multi-hot-encoder can beat one-hot-encoder by a lot? I just cannot wait to see the post :)

janmotl avatar Jan 05 '19 08:01 janmotl

Just write somewhere that | assumes uniform distribution of the feature values. For example, when the data contain 1|2, the encoder assumes that there is 0.5 probability that the real value is 1 and 0.5 probability that the real value is 2, even when the distribution of 1 and 2 in the training data is distinctly non-uniform

Exactly, that is a very important point. I added a prior option to solve the problem:

  • train: each probability is trained using input data when fitting
  • uniform: each probability is assumed to be uniform as I have ever implemented

I also added a default_prior option, which can be used with the prior option at the same time.

  • If a column is included in default_prior dictionary, prior is fixed as default_prior.
  • Else, prior is calculated using the prior option.

The problem of probability distribution would be solved by these options.

(Since the name 'prior' may be confusing, it will be renamed if necessary.)

Can you also write somewhere a real-world application of this encoder? When does it happen that we know, e.g.: 1|2? And can you illustrate on the blog that multi-hot-encoder can beat one-hot-encoder by a lot? I just cannot wait to see the post :)

I'm trying to compare the encoder performance using Boston and Titanic dataset, but the comparison is now based on artificial preprocessing by masking several rows. I'm sorry that I have not found any good real-world dataset which is available for free and naturally contains ambiguous categorical features... (I will search such a dataset)

[FYI]: In my experience, the situation when ambiguous features are obtained was caused by the change of data-acquisition process. After a certain day, the granularity of a feature have actually changed. I believe that such a dirty feature is generated in various business field.

fullflu avatar Jan 06 '19 11:01 fullflu

Nice. I like the use of assert_frame_equal() (I didn't know that it existed). And that you wrote the default settings in the documentation.

I propose to rename the optional argument names to something like:

  1. distribution, which accepts {'uniform', 'prior'}
  2. prior (if not provided, it is estimated from the training data as arithmetic mean of the target for each attribute value)

But I am leaving the final pick up to you.

In my experience, the situation when ambiguous features are obtained was caused by the change of data-acquisition process. After a certain day, the granularity of a feature have actually changed. I believe that such a dirty feature is generated in various business field.

That's a nice example.

janmotl avatar Jan 06 '19 14:01 janmotl

Thank you for your nice suggestion.

This is the next plan:

  • Coming soon (within a few weeks at the latest)
    • rename multiple_split_string to or_delimiter
    • rename and fix prior-related options as your suggestion
    • create examples and blog-like posts
  • Future enhancement
    • implement and_delimiter
    • integrate missing-value imputation methods (if possible)
    • integrate information theoretic methods (if possible)

Details of future enhancement

The ambiguity problem would be inherently included in missing-value imputation problem. The encoding method that I have implemented is based on the empirical distribution, however, other imputation methods based on machine-learning can be integrated with this delimiter-based multi-hot encoding. I expect someone who is interested in this topic could contribute such new encoding methods in the future!

fullflu avatar Jan 07 '19 07:01 fullflu

implement and_delimiter

How is it going to work? Is it similar to TfidfVectorizer or CountVectorizer? A potentially useful dataset for the functionality illustration: data, description. In my opinion, it is a pretty dirty dataset. But if the encoder is going to work well on this, it's likely going to work well on many other datasets.

integrate missing-value imputation methods

It's up to you. The canonical solution is to propagate NaN to the output. And then use some canned solution for missing value imputation. But I can imagine missing value treatments, that would not work without seeing the raw data.

integrate information theoretic methods

The canonical solution for change of granularity in the data would be to use a hierarchical model (a.k.a mixed model). But there are many alternatives.

janmotl avatar Jan 07 '19 11:01 janmotl

rename and fix prior-related options as your suggestion

Although your suggestion is so cool, I found that your prior-related options would discard the flexibility. That is why I want to remain the prior options. The details are described below. Feel free to correct if the description is wrong.

The option that I implemented can consider 5 cases:

  1. all cols are transformed by uniform distribution (prior is 'uniform' and default_prior is None)
  2. all cols are transformed by empirical distribution (prior is 'train' and default_prior is None)
  3. all cols are transformed by default_prior (all cols are included in default_prior)
  4. several cols are transformed by default_prior and others are transformed by empirical distribution (prior is 'train' and default_prior is not None)
  5. several cols are transformed by default_prior and others are transformed by uniform prior (prior is 'uniform' and default_prior is not None)

Your suggestion would not be able to consider the 5th option above.

This is a tradeoff between flexibility and simplicity. If there is another way in which both the flexibility and simplicity are achieved, I would adopt the options.

fullflu avatar Jan 07 '19 14:01 fullflu

How is it going to work? Is it similar to TfidfVectorizer or CountVectorizer?

I have imagined simpler than them. For each column, all rows where and_delimiter is included are transformed by multi-hot encoder without normalization. This process will reflect the meaning A and B, and it would be easily implemented.

integrate missing-value imputation methods integrate information theoretic methods

The canonical solution you described will be nice. Since I do not have enough solution to solve them, I will survey related work. These two methods can be out of scope unless any good paper is found.

fullflu avatar Jan 07 '19 14:01 fullflu

dictionary used as prior (hyperprior is [1,1,1,...],...

Nice touch with the hyperprior.

janmotl avatar Jan 07 '19 18:01 janmotl

Hi, @fullflu. Is there something I can help you with?

janmotl avatar Feb 25 '19 13:02 janmotl