DecisionTree.jl icon indicating copy to clipboard operation
DecisionTree.jl copied to clipboard

Speed comparison with R

Open RobertToth opened this issue 11 years ago • 8 comments

Hello, I tried these two codes in Julia and R respectively:

Ytrain=rand(2000,1) Xtrain=rand(2000,60)

addprocs(3) using DecisionTree

@time model = build_forest(Ytrain[:,1],Xtrain,20,200,5,1)

library(randomForest)

X=replicate(60, runif(2000)) Y=runif(2000)

ptm <- proc.time() rf=randomForest(X, Y, importance = FALSE, ntree=200,do.trace=0,nodesize=5) proc.time() - ptm

In Julia it takes around 30 seconds and in R it takes around 10. I think that configuration of both "build_forest" and "randomForest" are the same (as rF takes one third of the variables for each node which is exactly 20). And as far as I know, rF in R cant use paralellization (at least this library cant) and Julia should be way faster than R.

So, what might be causing the difference in speed?

RobertToth avatar Nov 04 '14 12:11 RobertToth

Yeah, that's a significant speed difference. I think the R package uses streaming mean-squared-error for the cost function, which is much more efficient. I've been meaning to implement it, just haven't found the time.

bensadeghi avatar Nov 05 '14 03:11 bensadeghi

Not sure if if will make a difference here, but running in global scope often prevents the compiler from optimizing fully. You might try wrapping the code in a function. Take a look at the performance tips section in the Julia manual.

Cheers, Kevin

On Tuesday, November 4, 2014, Ben Sadeghi [email protected] wrote:

Yeah, that's a significant speed difference. I think the R package uses streaming mean-squared-error for the cost function, which is much more efficient. I've been meaning to implement it, just haven't found the time.

— Reply to this email directly or view it on GitHub https://github.com/bensadeghi/DecisionTree.jl/issues/15#issuecomment-61754978 .

kmsquire avatar Nov 05 '14 06:11 kmsquire

I tried the function approach. It did not help, but it sure is important to know it, thank you for pointing it out. Where can I learn more about that streaming MSE? I tried to google it, but failed. Is it a technique connected directly to random forests?

Thank you for your time, you're doing a great job

RobertToth avatar Nov 05 '14 10:11 RobertToth

The scikit-learn DT doc refers to it as a running count - http://scikit-learn.org/stable/modules/tree.html#complexity Ben Hamner had a go at it, but I'm not sure if it works as intended - https://github.com/benhamner/MachineLearning.jl/blob/master/src/decision_tree.jl#L163 I haven't seen much more on it.

bensadeghi avatar Nov 05 '14 12:11 bensadeghi

Any plans to optimize it any time soon?

Thanks for the great work you have done!

ValdarT avatar May 29 '17 13:05 ValdarT

Some optimization along these lines was done in #31. I don't entirely understand what the running count is supposed to be, but I could have a look eventually. I would be curious to see a new proper comparison with R or scikit-learn, using @benchmark. The comparison with R is from 2014, and Julia itself has been severely optimized since then.

cstjean avatar May 29 '17 13:05 cstjean

I've run the above tests using BenchmarkTools and using the same data across R and Julia. For Julia, the single core result is 14.5s For R, the single core results is 8.9s

However, with multiple cores (8 threads) I get 3.6s on Julia, so the multiple cores definitely help out. This is using a Core i7-6820 with 4 cores/8 threads on Windows 10. Julia 0.6 and R 3.4.1

niczky12 avatar Aug 14 '17 10:08 niczky12

There have been some progress on this front with the DecisionTree v0.8.1 release (requires Julia v0.7-v1.0).

Note that even though features and labels of type Array{Any} are supported, it is highly recommended that data be cast to explicit types (ie with float.(), string.(), etc). This significantly improves model training and prediction execution times.

bensadeghi avatar Aug 20 '18 03:08 bensadeghi