cruise icon indicating copy to clipboard operation
cruise copied to clipboard

Performance comparison between Dolphin and other frameworks

Open yunseong opened this issue 9 years ago • 5 comments

We've run experiments by using same LR algorithm with URL reputation dataset on multiple frameworks: Dolphin, Vortex. As @beomyeol mentioned at the meeting, we've seen some performance issues such as vector computation, data loading, etc. We can also take a look at Spark because it can run LR algorithm and the performance turned out to be much faster than Vortex (not sure compared to Dolphin yet).

This issue aims to investigate the performance of both frameworks as we can run the same algorithm on the same data set. It would be great if we can find some points to improve in performance.

As a first step, I'll run the experiment on Microsoft YARN cluster which consists of 20 machines (8core CPU, 8GB RAM, YARN 2.7.1).

yunseong avatar Nov 11 '15 12:11 yunseong

@jsjason, @johnyangk, @beomyeol , @gyeongin and I had a discussion and it'd be good to check followings:

  • How scalable are the systems? In @gyeongin's experiment, he didn't use the full data set (URLreputation, 2.2GB), but rather some partial data (around 80MB). When I tried to run 2.2 GB, the job failed at some moment. I have not investigated the logs yet, but the exception was not OOM since I used SparseVector. Vortex also had some scalability issues, but after we changed to use HDFS, we can run LR smoothly with at least 2.2 GB dataset. This might NOT weaken our argument ("elastic memory management is crucial for the performance"), but I think it'd be good to check which made the LR implementation not scalable. How does it sound?
  • The performance of vector computation All systems use Sparse Vector for vector computation, but we used different implementations: Dolphin used Mahout's SparseVector and Vortex built a custom vector based on Array and Map (@jsjason volunteered to check the one in Spark). As @beomyeol and @gyeongin mentioned the vector computation appeared quite slow, it'd be good to check 1) how the implementations are different and 2) how the computation takes differently. For 2), I think we can compare the ComputeTasks' computation time per iteration.
  • The accuracy The algorithm computes accuracy in each iteration, but the value seems a bit confusing. Here I found in the log in Dolphin LR:
0-th iteration accuracy: 0%
1-th iteration accuracy: 65.96%
2-th iteration accuracy: 65.96%
3-th iteration accuracy: 65.96%

The result were almost same (similar accuracy value, fixed after "1-th iteration"(2nd actually)) in Vortex as I referred most part of the algorithm from Dolphin's. Even when using the full data set, the result was still similar. Spark, on the other hand, works differently: 1) the accuracy grows as iteration goes, and 2) the ultimate accuracy is higher with the same iteration. It'd be worth taking a look at algorithm for correctness.

If you have more things to want to check or ask, please feel free to add. After that, I think this issue can be split into multiple items.

yunseong avatar Nov 12 '15 02:11 yunseong

I pushed a branch named gy-lr-test which uses scala library breeze instead of mahout. I made some changes to save memory and improve performance. With the entire URL reputation dataset and R730-02 machine(48core cpu, 128GB memory), the job took 263 seconds. This is the command I used for test: ./run_logistic.sh -dim 3231961 -maxIter 20 -stepSize 1.0 -lambda 0.01 -local false -split 8 -input /total -output output_logistic -maxNumEvalLocal 5 -isDense false -evalSize 1000 -timeout 1200000 The final model accuracy was 93.421%.

gyeongin avatar Dec 15 '15 06:12 gyeongin

@gyeongin Thanks for sharing the result. This looks awesome! Could you kindly give us a very short summary what changes you made for better performance and memory usage?

yunseong avatar Dec 15 '15 07:12 yunseong

@gyeongin Great!

johnyangk avatar Dec 15 '15 07:12 johnyangk

Changes I made:

  • Use breeze instead of mahout (mahout does not support inplace update)
  • Change vector computations to inplace update to not allocate redundant new vectors
  • Use DenseVector for model and cumGradient (dv + sv is much faster than sv + sv)
  • Each ComputeTasks update the model only once in each iteration

gyeongin avatar Dec 15 '15 08:12 gyeongin