NeMo-Curator
NeMo-Curator copied to clipboard
Use `overwrite=True` when writing out intermediate files
trafficstars
Is your feature request related to a problem? Please describe.
When using a cache_dir in modules like FuzzyDedup if the user provides a cache_dir that was previously also used, if we don't use overwrite, the user might accidentally be reading leftover files from the previous write.
Describe the solution you'd like
Specify overwrite=True
Describe alternatives you've considered Current approach requires user to clean up the cache each time, specify a new cache_dir.
Additional context