spark Shuffle consolidation

In Spark, it is common practice (and usually preferred) to launch a large number of small tasks, which unfortunately can create an even larger number of very small shuffle files – one of the major shuffle performance issues. To address this issue, this change will combine multiple such files (for the same reduce partition or bucket) into one single large shuffle file. Please see SPARK-751 for the detailed design.

Jul 03 '13 12:07 jason-dai

Thank you for your pull request. An admin will review this request soon.

Jul 03 '13 12:07 AmplabJenkins

Hey Jason, I read the design document and this looks very good! Thanks for writing that up. I'm going to go over the code in detail in the next few days to see if I have comments.

Jul 06 '13 19:07 mateiz

Jenkins, this is ok to test

Jul 13 '13 23:07 mateiz

Thank you for submitting this pull request. Unfortunately, the automated tests for this request have failed. Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/199/

Jul 13 '13 23:07 AmplabJenkins

Jenkins, retest this please

Jul 14 '13 01:07 mateiz

Thank you for submitting this pull request. Unfortunately, the automated tests for this request have failed. Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/203/

Jul 14 '13 01:07 AmplabJenkins

Matei - you can review this now but don't merge it. I talked to @jason-dai the other day and he said there were a couple more changes he wanted to add.

Jul 14 '13 05:07 rxin

I have actually made all the changes I want :-)

Jul 14 '13 10:07 jason-dai

Hey Jason,

I've looked through this in more detail now and made some comments, mostly about style. Overall it looks very good. Before merging it though I also want to test it on a cluster in various scenarios -- we're working on a performance test suite to validate these kinds of changes. But thanks for putting this together!

Jul 18 '13 21:07 mateiz

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/244/

Jul 22 '13 08:07 AmplabJenkins

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/246/

Jul 22 '13 13:07 AmplabJenkins

Hey Jason, so FYI, some unit tests are legitimately failing with this (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/246/) -- it seems to be because the tests compare some arrays of Longs by equality instead of dealing with the individual members.

Jul 22 '13 17:07 mateiz

Thank you for submitting this pull request.

Unfortunately, the automated tests for this request have failed.

Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/300/

Jul 26 '13 02:07 AmplabJenkins

I just tried to run this on a cluster and jobs failed with a fetch failure issue:

13/07/29 05:12:33 INFO cluster.ClusterTaskSetManager: Lost TID 415 (task 1.0:15)
13/07/29 05:12:33 INFO cluster.ClusterTaskSetManager: Loss was due to fetch failure from null

I can try and dig into this more tomorrow.

Jul 29 '13 05:07 pwendell

We are setting up a cluster for large scale testing as well.

Jul 29 '13 06:07 jason-dai

Thank you for your pull request. An admin will review this request soon.

Aug 05 '13 21:08 AmplabJenkins

Hey Jason, just curious, have you had a chance to look into this? We'd like to code-freeze Spark 0.8 soon (ideally at the end of this week), and it would be nice to get this into it. There can be some time for bug fixing after but we should make sure it's working well before we add it. Of course, if it's too early, we'll just push it to a future release.

Aug 06 '13 01:08 mateiz

Hi Matei - sorry we haven't had enough time to look into this yet. Maybe we should push it to a future release, as we'll be working on the graphx performance in the next couple of weeks.

Aug 06 '13 02:08 jason-dai

No worries if it can't be done now, but I'm curious, have you guys been running with this code for a while, or are you still using the old version internally? Basically I'm wondering whether it requires a lot of testing or just a little. We can help with the testing too.

Aug 06 '13 02:08 mateiz

Small scale testing works fine, but we ran into some wired failures in large scale testing and had not had enough time to look into it.

Aug 06 '13 02:08 jason-dai

spark spark copied to clipboard

Shuffle consolidation

spark
spark copied to clipboard