spark icon indicating copy to clipboard operation
spark copied to clipboard

Clean up DAGScheduler datastructures after job completes

Open jhartlaub opened this issue 12 years ago • 9 comments
trafficstars

I added some fixes to reduce heap/prevent memory leak we found after running a load test (thousands of jobs). There may be more leaks, but this fixes a big one.

jhartlaub avatar Jan 25 '13 05:01 jhartlaub

Good catch; the only thing I'm not 100% sure of is whether this will work when two active jobs depend on the same stage. (This can only happen if two threads submit a job at the same time, but that may occur in a multi-user system or in Spark Streaming.) I'll look into it more carefully, but if you've thought about this, let me know.

mateiz avatar Jan 25 '13 06:01 mateiz

This will remove all stages submitted under a single job - if another job comes along and depends on the same stages, it will not work. In our cases, we have not resubmitted the same RDD's multiple types (using shark we create new RDD's every time).

I can see now where getShuffleMapStage in wich a lookup matches a stage by an ID from the RDD. I think a per-job reference-count for the stage-related maps might work- seem plausible?

Thank you for looking at this, btw. You seem very busy if the mailing list is any indication.

-Jon

jhartlaub avatar Jan 25 '13 07:01 jhartlaub

Yeah, I think a reference count would be better. We should also make a test where multiple jobs depend on the same stage. For example:

val pairs = sc.parallelize(...)
val grouped = pairs.groupByKey()
grouped.filter(f1).count()
grouped.filter(f2).count()

Here both the actions should depend on that first shuffle map stage before the group-by.

The other thing I've been thinking about is whether we should just use WeakHashMap for most of the data structures. This way stages will only be kept for RDDs that the user program references somehow. I think this will be a fair amount of work but it might be the right thing in the long term.

mateiz avatar Jan 26 '13 21:01 mateiz

Can one of the admins verify this patch?

AmplabJenkins avatar Apr 04 '13 21:04 AmplabJenkins

I'm the Jenkins test bot for the UC, Berkeley AMPLab. I've noticed your pull request and will test it once an admin authorizes me to. Thanks for your submission!

AmplabJenkins avatar Apr 10 '13 20:04 AmplabJenkins

I'm the Jenkins test bot for the UC, Berkeley AMPLab. I've noticed your pull request and will test it once an admin authorizes me to. Thanks for your submission!

AmplabJenkins avatar Apr 18 '13 22:04 AmplabJenkins

Bump -- any progress on this one?

velvia avatar Jul 25 '13 16:07 velvia

I think this will have to be done after Spark 0.8, because of the reference-counting issue above. However, @markhamstra has also been looking at the reference counting for some of his own work, I believe.

mateiz avatar Jul 25 '13 17:07 mateiz

Thank you for your pull request. An admin will review this request soon.

AmplabJenkins avatar Aug 05 '13 21:08 AmplabJenkins