NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Use `overwrite=True` when writing out intermediate files

Open praateekmahajan opened this issue 1 year ago • 0 comments
trafficstars

Is your feature request related to a problem? Please describe. When using a cache_dir in modules like FuzzyDedup if the user provides a cache_dir that was previously also used, if we don't use overwrite, the user might accidentally be reading leftover files from the previous write.

Describe the solution you'd like Specify overwrite=True

Describe alternatives you've considered Current approach requires user to clean up the cache each time, specify a new cache_dir.

Additional context Image

praateekmahajan avatar Oct 18 '24 23:10 praateekmahajan