category_encoders
category_encoders copied to clipboard
Multi-hot encoding for ambiguous input
I propose to implement simple multi-hot encoding which allows ambiguous input and outputs non-negative value.
Let x_j be a realization of department of a student. Usually, we assume that x_j is defined without ambiguity, such as mathematics, physics, and so on. In real-world dataset, however, we sometimes only know the ambiguous value, such as sciences. I want to encode such ambiguous categorical features.
I implemented fit and transform function and tested them by several cases (WIP). I focused on the case in which ambiguous value is represented by a delimiter (as 'mathematics|physics', which means x_j is mathematics or physics).
I hope you to discuss the potential and usefulness of this type of implementation. I wonder if, at least, following updates are necessary:
- implement inverse_transform function
- handle impute parameter
- implement not only 'or' type delimiter, but also 'and' type delimiter
- reflect latest research in the field of data mining.
The difference between multi-hot encoder and the related issues may be as follows:
- #77 : I simply encode ambiguous|dirty feature as well as one-hot encoding. I did not consider the similarity.
- #136 : Target encoding may be useful for ambiguous|dirty feature, but I focus on simple multi-hot encoding.
The merit of multi-hot encoding is its simplicity and efficiency.
- Simplicity: We only need to insert delimiter string into dirty categorical feature.
- Efficiency: If we prepare a mapping which represents relationships between ambiguous|dirty categories and feature without ambiguity, we will need a lot of memory capacity for j-th feature (O(2^C_j), where C_j is the cardinality of j-th feature). Multi-hot encoding needs O(C_j) memory.
Soon I will send a pull request. I look forward to hearing from you.
I wonder if, at least, following updates are necessary: implement inverse_transform function
Inverse transform is not necessary.
handle impute parameter
The functionality of this argument is in the process of overhaul -> at this moment, it can be ignored.
implement not only 'or' type delimiter, but also 'and' type delimiter
This is up to you.
reflect latest research in the field of data mining.
It would be nice if the code included a reference to some article or a blog post that would illustrate on a trivial example how the encoder works.
Thank you for your response.
It would be nice if the code included a reference to some article or a blog post that would illustrate on a trivial example how the encoder works.
OK, I will write a blog that uses this encoder reflecting the discussion in the Pull-Request.