spark Clean up DAGScheduler datastructures after job completes

trafficstars

I added some fixes to reduce heap/prevent memory leak we found after running a load test (thousands of jobs). There may be more leaks, but this fixes a big one.

Jan 25 '13 05:01 jhartlaub

Good catch; the only thing I'm not 100% sure of is whether this will work when two active jobs depend on the same stage. (This can only happen if two threads submit a job at the same time, but that may occur in a multi-user system or in Spark Streaming.) I'll look into it more carefully, but if you've thought about this, let me know.

Jan 25 '13 06:01 mateiz

This will remove all stages submitted under a single job - if another job comes along and depends on the same stages, it will not work. In our cases, we have not resubmitted the same RDD's multiple types (using shark we create new RDD's every time).

I can see now where getShuffleMapStage in wich a lookup matches a stage by an ID from the RDD. I think a per-job reference-count for the stage-related maps might work- seem plausible?

Thank you for looking at this, btw. You seem very busy if the mailing list is any indication.

-Jon

Jan 25 '13 07:01 jhartlaub

Yeah, I think a reference count would be better. We should also make a test where multiple jobs depend on the same stage. For example:

val pairs = sc.parallelize(...)
val grouped = pairs.groupByKey()
grouped.filter(f1).count()
grouped.filter(f2).count()

Here both the actions should depend on that first shuffle map stage before the group-by.

The other thing I've been thinking about is whether we should just use WeakHashMap for most of the data structures. This way stages will only be kept for RDDs that the user program references somehow. I think this will be a fair amount of work but it might be the right thing in the long term.

Jan 26 '13 21:01 mateiz

Can one of the admins verify this patch?

Apr 04 '13 21:04 AmplabJenkins

I'm the Jenkins test bot for the UC, Berkeley AMPLab. I've noticed your pull request and will test it once an admin authorizes me to. Thanks for your submission!

Apr 10 '13 20:04 AmplabJenkins

I'm the Jenkins test bot for the UC, Berkeley AMPLab. I've noticed your pull request and will test it once an admin authorizes me to. Thanks for your submission!

Apr 18 '13 22:04 AmplabJenkins

Bump -- any progress on this one?

Jul 25 '13 16:07 velvia

I think this will have to be done after Spark 0.8, because of the reference-counting issue above. However, @markhamstra has also been looking at the reference counting for some of his own work, I believe.

Jul 25 '13 17:07 mateiz

Thank you for your pull request. An admin will review this request soon.

Aug 05 '13 21:08 AmplabJenkins

spark spark copied to clipboard

Clean up DAGScheduler datastructures after job completes

spark
spark copied to clipboard