DecisionTree.jl Speed comparison with R

Hello, I tried these two codes in Julia and R respectively:

Ytrain=rand(2000,1) Xtrain=rand(2000,60)

addprocs(3) using DecisionTree

@time model = build_forest(Ytrain[:,1],Xtrain,20,200,5,1)

library(randomForest)

X=replicate(60, runif(2000)) Y=runif(2000)

ptm <- proc.time() rf=randomForest(X, Y, importance = FALSE, ntree=200,do.trace=0,nodesize=5) proc.time() - ptm

In Julia it takes around 30 seconds and in R it takes around 10. I think that configuration of both "build_forest" and "randomForest" are the same (as rF takes one third of the variables for each node which is exactly 20). And as far as I know, rF in R cant use paralellization (at least this library cant) and Julia should be way faster than R.

So, what might be causing the difference in speed?

Nov 04 '14 12:11 RobertToth

Yeah, that's a significant speed difference. I think the R package uses streaming mean-squared-error for the cost function, which is much more efficient. I've been meaning to implement it, just haven't found the time.

Nov 05 '14 03:11 bensadeghi

Not sure if if will make a difference here, but running in global scope often prevents the compiler from optimizing fully. You might try wrapping the code in a function. Take a look at the performance tips section in the Julia manual.

Cheers, Kevin

On Tuesday, November 4, 2014, Ben Sadeghi [email protected] wrote:

Yeah, that's a significant speed difference. I think the R package uses streaming mean-squared-error for the cost function, which is much more efficient. I've been meaning to implement it, just haven't found the time.

— Reply to this email directly or view it on GitHub https://github.com/bensadeghi/DecisionTree.jl/issues/15#issuecomment-61754978 .

Nov 05 '14 06:11 kmsquire

I tried the function approach. It did not help, but it sure is important to know it, thank you for pointing it out. Where can I learn more about that streaming MSE? I tried to google it, but failed. Is it a technique connected directly to random forests?

Thank you for your time, you're doing a great job

Nov 05 '14 10:11 RobertToth

The scikit-learn DT doc refers to it as a running count - http://scikit-learn.org/stable/modules/tree.html#complexity Ben Hamner had a go at it, but I'm not sure if it works as intended - https://github.com/benhamner/MachineLearning.jl/blob/master/src/decision_tree.jl#L163 I haven't seen much more on it.

Nov 05 '14 12:11 bensadeghi

Any plans to optimize it any time soon?

Thanks for the great work you have done!

May 29 '17 13:05 ValdarT

Some optimization along these lines was done in #31. I don't entirely understand what the running count is supposed to be, but I could have a look eventually. I would be curious to see a new proper comparison with R or scikit-learn, using @benchmark. The comparison with R is from 2014, and Julia itself has been severely optimized since then.

May 29 '17 13:05 cstjean

I've run the above tests using BenchmarkTools and using the same data across R and Julia. For Julia, the single core result is 14.5s For R, the single core results is 8.9s

However, with multiple cores (8 threads) I get 3.6s on Julia, so the multiple cores definitely help out. This is using a Core i7-6820 with 4 cores/8 threads on Windows 10. Julia 0.6 and R 3.4.1

Aug 14 '17 10:08 niczky12

There have been some progress on this front with the DecisionTree v0.8.1 release (requires Julia v0.7-v1.0).

Note that even though features and labels of type Array{Any} are supported, it is highly recommended that data be cast to explicit types (ie with float.(), string.(), etc). This significantly improves model training and prediction execution times.

Aug 20 '18 03:08 bensadeghi

DecisionTree.jl DecisionTree.jl copied to clipboard

Speed comparison with R

DecisionTree.jl
DecisionTree.jl copied to clipboard