DecisionTree.jl
DecisionTree.jl copied to clipboard
Speed comparison with R
Hello, I tried these two codes in Julia and R respectively:
Ytrain=rand(2000,1) Xtrain=rand(2000,60)
addprocs(3) using DecisionTree
@time model = build_forest(Ytrain[:,1],Xtrain,20,200,5,1)
library(randomForest)
X=replicate(60, runif(2000)) Y=runif(2000)
ptm <- proc.time() rf=randomForest(X, Y, importance = FALSE, ntree=200,do.trace=0,nodesize=5) proc.time() - ptm
In Julia it takes around 30 seconds and in R it takes around 10. I think that configuration of both "build_forest" and "randomForest" are the same (as rF takes one third of the variables for each node which is exactly 20). And as far as I know, rF in R cant use paralellization (at least this library cant) and Julia should be way faster than R.
So, what might be causing the difference in speed?
Yeah, that's a significant speed difference. I think the R package uses streaming mean-squared-error for the cost function, which is much more efficient. I've been meaning to implement it, just haven't found the time.
Not sure if if will make a difference here, but running in global scope often prevents the compiler from optimizing fully. You might try wrapping the code in a function. Take a look at the performance tips section in the Julia manual.
Cheers, Kevin
On Tuesday, November 4, 2014, Ben Sadeghi [email protected] wrote:
Yeah, that's a significant speed difference. I think the R package uses streaming mean-squared-error for the cost function, which is much more efficient. I've been meaning to implement it, just haven't found the time.
— Reply to this email directly or view it on GitHub https://github.com/bensadeghi/DecisionTree.jl/issues/15#issuecomment-61754978 .
I tried the function approach. It did not help, but it sure is important to know it, thank you for pointing it out. Where can I learn more about that streaming MSE? I tried to google it, but failed. Is it a technique connected directly to random forests?
Thank you for your time, you're doing a great job
The scikit-learn DT doc refers to it as a running count - http://scikit-learn.org/stable/modules/tree.html#complexity Ben Hamner had a go at it, but I'm not sure if it works as intended - https://github.com/benhamner/MachineLearning.jl/blob/master/src/decision_tree.jl#L163 I haven't seen much more on it.
Any plans to optimize it any time soon?
Thanks for the great work you have done!
Some optimization along these lines was done in #31. I don't entirely understand what the running count is supposed to be, but I could have a look eventually. I would be curious to see a new proper comparison with R or scikit-learn, using @benchmark. The comparison with R is from 2014, and Julia itself has been severely optimized since then.
I've run the above tests using BenchmarkTools and using the same data across R and Julia.
For Julia, the single core result is 14.5s
For R, the single core results is 8.9s
However, with multiple cores (8 threads) I get 3.6s on Julia, so the multiple cores definitely help out. This is using a Core i7-6820 with 4 cores/8 threads on Windows 10. Julia 0.6 and R 3.4.1
There have been some progress on this front with the DecisionTree v0.8.1 release (requires Julia v0.7-v1.0).
Note that even though features and labels of type Array{Any} are supported, it is highly recommended that data be cast to explicit types (ie with float.(), string.(), etc). This significantly improves model training and prediction execution times.