data-juicer
data-juicer copied to clipboard
Optimize dedup to avoid oom
distinct()will storage the data, and the downstream will read from the shuffle. We do not need the cache any more.- Use the
count()to instead thecollect()to avoid the drive OOM.