Jan Motl comments

Results 82 comments of


                                            Jan Motl

Parallel encoding of features in woe encoding

Feel free to vectorize or parallelize it as I don't have to it on my to-do list. Btw., target encoder has some parallelization.

Large scale benchmark

@rhiever: PMLB is awesome! However, do you/can you provide datasets with unprocessed categorical attributes? When I looked at the repository, all categorical attributes were already encoded with one-hot or ordinal...

Large scale benchmark

I wrote a draft of the benchmark and it is at: ~~https://github.com/janmotl/categorical-encoding/tree/binary/examples/benchmarking_large~~ **Edit**: In the master branch under `examples/benchmarking_large`. What it does: It takes 65 datasets and applies different encoders...

Large scale benchmark

@wdm0006 I added memory consumption of the encoders. The code utilizes `memory_profiler`. However, I am not overly happy with the deployment of `memory_profiler` because it heavily impacts the runtime and,...

Large scale benchmark

@rhiever I am concerned about the parameter tuning as well. However, I am more concerned about the parameters of the encoders than of the classifiers (simply because of the orientation...

Large scale benchmark

I have uploaded a csv with the [results](https://github.com/janmotl/categorical-encoding/tree/binary/examples/benchmarking_large/output). Brief observations: 1. OneHotEncoding is, on average, the best encoder (at least based on testing AUC). 2. Each of the remaining encoders...

Large scale benchmark

Updated results are now in PR #110 ([link](https://github.com/scikit-learn-contrib/categorical-encoding/blob/9e2385f00975bcba7926396c6563eb8488d778f6/examples/benchmarking_large/output/result_2018-09-02.csv)). Notable changes: 1. Added Weight of Evidence encoder. 2. Impact encoders (Target encoder, Leave One Out and Weight of Evidence) should now...