mrjob limit entries per reducer key in Spark runner

limit entries per reducer key in Spark runner

Open coyotemarin opened this issue 5 years ago • 0 comments

If a MRJobs has a very large number of entries associated with the same reducer key, it can be difficult to run through the Spark runner, because all the entries end up in the same partition, which Spark attempts to put in memory.

One strategy is to set a hard limit on the number of values per reducer, and if there are too many, to simply discard them. We could prevent all the values from being loaded into memory at once by sharding the partition, and replacing the values in any shard with some sentinel value if the shard has more values than the limit.

Oct 28 '19 16:10 coyotemarin

mrjob mrjob copied to clipboard

limit entries per reducer key in Spark runner

mrjob
mrjob copied to clipboard