category_encoders icon indicating copy to clipboard operation
category_encoders copied to clipboard

[Feature request] Benchmark of encoding strategies for different tasks

Open Nathan-Furnal opened this issue 6 years ago • 1 comments

I know that I'm asking for a lot here but it'd be great to have some idea of what encoding strategies are useful in some cases : classification vs regression or when it's completely useless.

Basically some benchmarks and heuristics and a possible reason why some encoders work in some cases and others not.

I don't have the expertise at all but I think it'd be awesome to have something along those lines.

Nathan-Furnal avatar Oct 16 '19 22:10 Nathan-Furnal

The benchmarks are discussed in https://github.com/scikit-learn-contrib/categorical-encoding/issues/46. The results for classification are in examples/benchmarking_large/output.

We don't currently have a benchmark for regression - if you would be willing to write one, it would be awesome.

Also, we do not have a meta-learning study that would tell us when to use which encoder. A help in this regard would be greatly appreciated. How to do it? I think that it would be the best to use python to connect to OpenML, download the datasets and their metadata (count of rows, count of columns, ratio of the count of unique values in the column by the count of rows,...), encode the datasets with different encoders, train and evaluate the classifiers on the encoded data, and finally, train the metamodel to predict the models' accuracy based on the data metadata and the used encoders.

Personally, I use one-hot-encoding and its brothers for interoperability in regression models. Supervised encoders when I have to deal with high-cardinality categorical attributes. And hashing encoder, when there is not enough memory to store the trained encoder.

janmotl avatar Oct 17 '19 09:10 janmotl