LLM-Finetuning-Toolkit icon indicating copy to clipboard operation
LLM-Finetuning-Toolkit copied to clipboard

[Dataset] Dataset Generation Always Returns Cached Version

Open benjaminye opened this issue 1 year ago • 1 comments

Describe the bug At dataset creation, the dataset generated will always get the cached version despite change in file.

To Reproduce

  1. Run toolkit.py
  2. Ctrl-C
  3. Add a line in the dataset
  4. toolkit.py will not create a new dataset with desired changes

Expected behavior

  1. Dataset to be generated with new data

Environment:

  • OS: Ubuntu

benjaminye avatar Mar 27 '24 20:03 benjaminye

This is caused by huggingface Dataset.from_generator() method checking to see if dataset is cached. See code.

Easiest solution is to pass in a cache_dir parameter (like ./dataset_cache) with each Ingestor class, for example here.

That way whenever there's a change in local file, user can delete the cache directory ./dataset_cache.


Future Enhancement

  • Perhaps we can have a config no_cache under config.data, and the toolkit will go ahead and delete ./dataset_cache directory

benjaminye avatar Mar 27 '24 20:03 benjaminye