Jan Motl
Jan Motl
Feel free to vectorize or parallelize it as I don't have to it on my to-do list. Btw., target encoder has some parallelization.
@rhiever: PMLB is awesome! However, do you/can you provide datasets with unprocessed categorical attributes? When I looked at the repository, all categorical attributes were already encoded with one-hot or ordinal...
I wrote a draft of the benchmark and it is at: ~~https://github.com/janmotl/categorical-encoding/tree/binary/examples/benchmarking_large~~ **Edit**: In the master branch under `examples/benchmarking_large`. What it does: It takes 65 datasets and applies different encoders...
@wdm0006 I added memory consumption of the encoders. The code utilizes `memory_profiler`. However, I am not overly happy with the deployment of `memory_profiler` because it heavily impacts the runtime and,...
@rhiever I am concerned about the parameter tuning as well. However, I am more concerned about the parameters of the encoders than of the classifiers (simply because of the orientation...
I have uploaded a csv with the [results](https://github.com/janmotl/categorical-encoding/tree/binary/examples/benchmarking_large/output). Brief observations: 1. OneHotEncoding is, on average, the best encoder (at least based on testing AUC). 2. Each of the remaining encoders...
Updated results are now in PR #110 ([link](https://github.com/scikit-learn-contrib/categorical-encoding/blob/9e2385f00975bcba7926396c6563eb8488d778f6/examples/benchmarking_large/output/result_2018-09-02.csv)). Notable changes: 1. Added Weight of Evidence encoder. 2. Impact encoders (Target encoder, Leave One Out and Weight of Evidence) should now...
Yes, LOO and WOE overfit particularly with decision tree, gradient boosting and random forest. Unfortunately, the graphs are not directly comparable because they are based on different subset of datasets....
I rerun the benchmark on older versions of the code. And by applying bisection method, it turned out that following code in LOO: ```python def fit_transform(self, X, y=None, **fit_params): """...
@eddiepyang The benchmark is now in this repository under `examples/benchmarking_large`.