spark icon indicating copy to clipboard operation
spark copied to clipboard

Shuffle consolidation

Open jason-dai opened this issue 11 years ago • 20 comments

In Spark, it is common practice (and usually preferred) to launch a large number of small tasks, which unfortunately can create an even larger number of very small shuffle files – one of the major shuffle performance issues. To address this issue, this change will combine multiple such files (for the same reduce partition or bucket) into one single large shuffle file. Please see SPARK-751 for the detailed design.

jason-dai avatar Jul 03 '13 12:07 jason-dai

Thank you for your pull request. An admin will review this request soon.

AmplabJenkins avatar Jul 03 '13 12:07 AmplabJenkins

Hey Jason, I read the design document and this looks very good! Thanks for writing that up. I'm going to go over the code in detail in the next few days to see if I have comments.

mateiz avatar Jul 06 '13 19:07 mateiz

Jenkins, this is ok to test

mateiz avatar Jul 13 '13 23:07 mateiz

Thank you for submitting this pull request. Unfortunately, the automated tests for this request have failed. Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/199/

AmplabJenkins avatar Jul 13 '13 23:07 AmplabJenkins

Jenkins, retest this please

mateiz avatar Jul 14 '13 01:07 mateiz

Thank you for submitting this pull request. Unfortunately, the automated tests for this request have failed. Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/203/

AmplabJenkins avatar Jul 14 '13 01:07 AmplabJenkins

Matei - you can review this now but don't merge it. I talked to @jason-dai the other day and he said there were a couple more changes he wanted to add.

rxin avatar Jul 14 '13 05:07 rxin

I have actually made all the changes I want :-)

jason-dai avatar Jul 14 '13 10:07 jason-dai

Hey Jason,

I've looked through this in more detail now and made some comments, mostly about style. Overall it looks very good. Before merging it though I also want to test it on a cluster in various scenarios -- we're working on a performance test suite to validate these kinds of changes. But thanks for putting this together!

mateiz avatar Jul 18 '13 21:07 mateiz

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/244/

AmplabJenkins avatar Jul 22 '13 08:07 AmplabJenkins

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/246/

AmplabJenkins avatar Jul 22 '13 13:07 AmplabJenkins

Hey Jason, so FYI, some unit tests are legitimately failing with this (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/246/) -- it seems to be because the tests compare some arrays of Longs by equality instead of dealing with the individual members.

mateiz avatar Jul 22 '13 17:07 mateiz

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/300/

AmplabJenkins avatar Jul 26 '13 02:07 AmplabJenkins

I just tried to run this on a cluster and jobs failed with a fetch failure issue:

13/07/29 05:12:33 INFO cluster.ClusterTaskSetManager: Lost TID 415 (task 1.0:15)
13/07/29 05:12:33 INFO cluster.ClusterTaskSetManager: Loss was due to fetch failure from null

I can try and dig into this more tomorrow.

pwendell avatar Jul 29 '13 05:07 pwendell

We are setting up a cluster for large scale testing as well.

jason-dai avatar Jul 29 '13 06:07 jason-dai

Thank you for your pull request. An admin will review this request soon.

AmplabJenkins avatar Aug 05 '13 21:08 AmplabJenkins

Hey Jason, just curious, have you had a chance to look into this? We'd like to code-freeze Spark 0.8 soon (ideally at the end of this week), and it would be nice to get this into it. There can be some time for bug fixing after but we should make sure it's working well before we add it. Of course, if it's too early, we'll just push it to a future release.

mateiz avatar Aug 06 '13 01:08 mateiz

Hi Matei - sorry we haven't had enough time to look into this yet. Maybe we should push it to a future release, as we'll be working on the graphx performance in the next couple of weeks.

jason-dai avatar Aug 06 '13 02:08 jason-dai

No worries if it can't be done now, but I'm curious, have you guys been running with this code for a while, or are you still using the old version internally? Basically I'm wondering whether it requires a lot of testing or just a little. We can help with the testing too.

mateiz avatar Aug 06 '13 02:08 mateiz

Small scale testing works fine, but we ran into some wired failures in large scale testing and had not had enough time to look into it.

jason-dai avatar Aug 06 '13 02:08 jason-dai