benchm-ml icon indicating copy to clipboard operation
benchm-ml copied to clipboard

SMILE

Open haifengl opened this issue 8 years ago • 9 comments

Thanks for great work! We have an open source machine learning library called SMILE (https://github.com/haifengl/smile). We have incorporated your benchmark (https://github.com/haifengl/smile/blob/master/benchmark/src/main/scala/smile/benchmark/Airline.scala). We found that our system is much faster for this data set. For 100K training data on a 4 core machine, we can train a random forest with 500 trees in 100 seconds, and gradient boost trees of 300 trees in 180 seconds. Projected to 32 cores, I think that we will be much faster than all the tools you tested. You can try it out by cloning our project. Then

sbt benchmark/run

This also includes benchmark on USPS data, which you may ignore. Thanks!

haifengl avatar Dec 23 '15 16:12 haifengl

A couple of questions about your benchmark. First about your data encoding. You use the origin 8 variables directly or convert them to other representation?

Besides, the data is highly unbalanced (positive : negative is about 1 : 4). Do you rebalance the data before training?

Can you also report other metric besides AUC, such as accuracy, sensitivity, specificity, etc. None of them are perfect. But it would be better report more than AUC. Thanks!

haifengl avatar Dec 23 '15 16:12 haifengl

BTW, our random forest AUC is low. It is because the prediction probabilities are derived from votes instead of from leaf weights. We will update the calculation ASAP.

The AUC of our gradient boost trees match other systems.

haifengl avatar Dec 23 '15 16:12 haifengl

Thanks, I'll try it out.

Re: questions. I use the original (categorical) encoding for the algos/implementations that can deal with it and 1-hot encoding for the ones that cannot.

1:4 is not really "highly" unbalanced (1:100 would be), so I do not rebalance.

Surely, AUC is not "complete", but it captures a lot of what I'm interested in.

Yes, for RF averaging probabilities gives better AUC than averaging votes.

szilard avatar Dec 26 '15 21:12 szilard

Thanks! There are two real valved variables (departure time and distance). You also treat them as categorical values?

This data is unbalanced. Even though AUC is at about 70%, the sensitivity is only about 10% (99% specificity), which is pretty much useless for this particular problem. Our implementation can assign different weights to classes. By adjusting the weight, we can achieve much higher sensitivity (of course lower specificity) and lower AUC. I feel that it is more meaningful in practice. As your benchmark is most about speed and memory usage, it may not be important.

haifengl avatar Dec 27 '15 00:12 haifengl

Have you tried it? Any help I can do? Thanks!

haifengl avatar Jan 07 '16 19:01 haifengl

No, sorry. And I'll have very limited time the next 3-4 weeks for sure. How about you take a look at this https://github.com/szilard/benchm-ml/tree/master/z-other-tools and you run random forests with 100 trees on 32 cores for the 1M dataset and you tell me the run time and AUC?

szilard avatar Jan 07 '16 20:01 szilard

No problem. I did run on the 1M dataset on my 4 core Mac (while I am using it for other things). Here is the print out:

--------------- 100K samples --------------------- class: "N", "Y" train data size: 100000, test data size: 100000 train data positive : negative = 19044 : 80956 test data positive : negative = 21617 : 78383 Training Random Forest of 500 trees... runtime: 40691.435646 ms Accuracy = 78.56% Sensitivity = 2.17% Specificity = 99.62% AUC = 69.05% OOB error rate = 18.93% runtime: 6321.360014 ms

Training Gradient Boosted Trees of 300 trees... Accuracy = 79.66% Sensitivity = 8.84% Specificity = 99.19% AUC = 72.50%

Training AdaBoost of 300 trees... runtime: 6180.334174 ms Accuracy = 79.06% Sensitivity = 7.85% Specificity = 98.70% AUC = 71.76%

--------------- 1M samples --------------------- class: "N", "Y" train data size: 1000000, test data size: 100000 train data positive : negative = 192982 : 807018 test data positive : negative = 21617 : 78383 Training Random Forest of 500 trees... runtime: 1436028.498601 ms Accuracy = 78.41% Sensitivity = 0.15% Specificity = 99.99% AUC = 69.91% OOB error rate = 19.26%

Training Gradient Boosted Trees of 300 trees... runtime: 83840.278901 ms Accuracy = 79.63% Sensitivity = 8.13% Specificity = 99.35% AUC = 72.79%

Training AdaBoost of 300 trees... runtime: 96979.686961 ms Accuracy = 79.15% Sensitivity = 8.32% Specificity = 98.68% AUC = 71.65%

Note that I report other metrics besides AUC and also run AdaBoost. For gradient boosting, I use your second settings (300 trees). Thanks!

haifengl avatar Jan 07 '16 20:01 haifengl

My running time is milliseconds. So it is about 1436 seconds for random forest, 84 seconds for gradient boosting, and 97 seconds for AdaBoost on the 1M dataset. As random forest training can be linearly scaled, I expect that we will use 1/8 time on a 32 core box. We also parallelize tree training in gradient boosting and AdaBoost. I expect we will use less time but won't as little as 1/8.

haifengl avatar Jan 07 '16 20:01 haifengl

BTW, we calculate AUC by our own implantation (https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/validation/AUC.java), which is based on Mann-Whitney U test. I am not sure if it is same as yours. If you want, I can ship you the prediction results and you can calculate it with your AUC method. Thanks!

haifengl avatar Jan 07 '16 20:01 haifengl