data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

Optimize dedup to avoid oom

Open coolderli opened this issue 1 year ago • 0 comments

  • distinct() will storage the data, and the downstream will read from the shuffle. We do not need the cache any more.
  • Use the count() to instead the collect() to avoid the drive OOM.

coolderli avatar Feb 07 '25 03:02 coolderli