LLM-Finetuning-Toolkit
LLM-Finetuning-Toolkit copied to clipboard
[Dataset] Dataset Generation Always Returns Cached Version
Describe the bug At dataset creation, the dataset generated will always get the cached version despite change in file.
To Reproduce
- Run
toolkit.py - Ctrl-C
- Add a line in the dataset
toolkit.pywill not create a new dataset with desired changes
Expected behavior
- Dataset to be generated with new data
Environment:
- OS:
Ubuntu
This is caused by huggingface Dataset.from_generator() method checking to see if dataset is cached. See code.
Easiest solution is to pass in a cache_dir parameter (like ./dataset_cache) with each Ingestor class, for example here.
That way whenever there's a change in local file, user can delete the cache directory ./dataset_cache.
Future Enhancement
- Perhaps we can have a config
no_cacheunderconfig.data, and the toolkit will go ahead and delete./dataset_cachedirectory