mrjob
mrjob copied to clipboard
limit entries per reducer key in Spark runner
If a MRJob
s has a very large number of entries associated with the same reducer key, it can be difficult to run through the Spark runner, because all the entries end up in the same partition, which Spark attempts to put in memory.
One strategy is to set a hard limit on the number of values per reducer, and if there are too many, to simply discard them. We could prevent all the values from being loaded into memory at once by sharding the partition, and replacing the values in any shard with some sentinel value if the shard has more values than the limit.