ray-legacy icon indicating copy to clipboard operation
ray-legacy copied to clipboard

Sparse matrix vector product benchmark (in comparison with spark and dask)

Open pcmoritz opened this issue 8 years ago • 3 comments

The numbers on the 3Mx3M test matrix from https://snap.stanford.edu/data/com-Orkut.html look like this:

scipy.sparse single threaded: 1.2s halo (1 node, 4 workers): 600ms halo (2 nodes, 4 workers each): 430ms dask (4 workers): 1.0s dask.distributed (1 node, 4 workers): 11s dask.distributed (2 nodes, 4 workers each): 8.1s

Distributed Dask presumably does not perform well, because it does not have an object store where the sparse matrix blocks can be stored. The single node version of dask does not need to perform serialization, but is limited by the Python GIL.

For pyspark, the full matrix gave a serialization error; using a 2Mx2M matrix gives:

scipy.sparse single threaded: 0.76s spark (1 node, 4 workers): 1.41s spark (2 node, 4 workers each): 1.56s

Before this is merged, we should check with the author of Dask that there is not a more efficient way to implement these operations.

pcmoritz avatar May 24 '16 01:05 pcmoritz

Nice numbers! I assume each node has at least four hardware threads?

Is it clear why the speed-up is roughly 3x in total instead of 8x (total number of cores)?

And I guess halo is the new name for Orchestra? :-)

ludwigschmidt avatar May 25 '16 02:05 ludwigschmidt

Also curious. :)

On Tue, May 24, 2016 at 7:25 PM Ludwig Schmidt [email protected] wrote:

Nice numbers! I assume each node has at least four hardware threads?

Is it clear why the speed-up is roughly 3x in total instead of 8x (total number of cores)?

And I guess halo is the new name for Orchestra? :-)

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/amplab/halo/pull/68#issuecomment-221456906

cathywu avatar May 25 '16 02:05 cathywu

Hey Cathy + Ludwig,

glad to hear from you! Scaling up sparse linear algebra on non-MPI systems is challenging, because each task is typically very small (in this case, it takes on the order of a few ms).

This was the first experiment where we got a speedup for sparse linear algebra on multiple nodes using Halo. Now that we understand better where the bottlenecks are (mainly the synchronous gRPC calls to the scheduler), we are going to address them in the next development iteration.

Best, Philipp.

pcmoritz avatar May 25 '16 21:05 pcmoritz