spark
spark copied to clipboard
Shuffle consolidation
In Spark, it is common practice (and usually preferred) to launch a large number of small tasks, which unfortunately can create an even larger number of very small shuffle files – one of the major shuffle performance issues. To address this issue, this change will combine multiple such files (for the same reduce partition or bucket) into one single large shuffle file. Please see SPARK-751 for the detailed design.
Thank you for your pull request. An admin will review this request soon.
Hey Jason, I read the design document and this looks very good! Thanks for writing that up. I'm going to go over the code in detail in the next few days to see if I have comments.
Jenkins, this is ok to test
Thank you for submitting this pull request. Unfortunately, the automated tests for this request have failed. Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/199/
Jenkins, retest this please
Thank you for submitting this pull request. Unfortunately, the automated tests for this request have failed. Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/203/
Matei - you can review this now but don't merge it. I talked to @jason-dai the other day and he said there were a couple more changes he wanted to add.
I have actually made all the changes I want :-)
Hey Jason,
I've looked through this in more detail now and made some comments, mostly about style. Overall it looks very good. Before merging it though I also want to test it on a cluster in various scenarios -- we're working on a performance test suite to validate these kinds of changes. But thanks for putting this together!
Thank you for submitting this pull request.
Unfortunately, the automated tests for this request have failed.
Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/244/
Thank you for submitting this pull request.
Unfortunately, the automated tests for this request have failed.
Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/246/
Hey Jason, so FYI, some unit tests are legitimately failing with this (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/246/) -- it seems to be because the tests compare some arrays of Longs by equality instead of dealing with the individual members.
Thank you for submitting this pull request.
Unfortunately, the automated tests for this request have failed.
Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/300/
I just tried to run this on a cluster and jobs failed with a fetch failure issue:
13/07/29 05:12:33 INFO cluster.ClusterTaskSetManager: Lost TID 415 (task 1.0:15)
13/07/29 05:12:33 INFO cluster.ClusterTaskSetManager: Loss was due to fetch failure from null
I can try and dig into this more tomorrow.
We are setting up a cluster for large scale testing as well.
Thank you for your pull request. An admin will review this request soon.
Hey Jason, just curious, have you had a chance to look into this? We'd like to code-freeze Spark 0.8 soon (ideally at the end of this week), and it would be nice to get this into it. There can be some time for bug fixing after but we should make sure it's working well before we add it. Of course, if it's too early, we'll just push it to a future release.
Hi Matei - sorry we haven't had enough time to look into this yet. Maybe we should push it to a future release, as we'll be working on the graphx performance in the next couple of weeks.
No worries if it can't be done now, but I'm curious, have you guys been running with this code for a while, or are you still using the old version internally? Basically I'm wondering whether it requires a lot of testing or just a little. We can help with the testing too.
Small scale testing works fine, but we ran into some wired failures in large scale testing and had not had enough time to look into it.