axolotl
axolotl copied to clipboard
Support loading a local hf dataset with `load_dataset`
⚠️ Please check that this feature request hasn't been suggested before.
- [X] I searched previous Ideas in Discussions didn't find any similar feature requests.
- [X] I searched previous Issues didn't find any similar feature requests.
🔖 Feature description
Loading a local folder requires to specify the data files.
In some cases, users need to upload their dataset to the hf hub so that the files and pre-processing steps can be completed properly.
In practice, it is possible to use the load_dataset(...) function (instead of load_from_disk(...)) on a local folder to avoid the uploading step, this is especially usefull if there a custom dataset config file inside the folder.
✔️ Solution
Check this line and add some logic for the load_dataset(...) function to be used.
Note:
load_from_disk is only intended to be used on directories created with Dataset.save_to_disk or DatasetDict.save_to_disk which is not the case for load_dataset(...).
Edit:
Using load_from_disk if the directory is not created with Dataset.save_to_disk or DatasetDict.save_to_disk leads to this error:
FileNotFoundError: Directory is neither a `Dataset` directory nor a `DatasetDict` directory.
This is not the case with load_dataset(...).
❓ Alternatives
No response
📝 Additional Context
No response
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this feature has not been requested yet.
- [X] I have provided enough information for the maintainers to understand and evaluate this request.
Can you recommend a way to reproduce this? What files should I be putting into the directory? Thanks!
Can you recommend a way to reproduce this? What files should I be putting into the directory? Thanks!
Here a way to reproduce @winglian :
git clone https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test
Modify the script to load the folder locally:
datasets:
- path: alpaca_2k_test
type: alpaca
Returns:
FileNotFoundError: Directory alpaca_2k_test is neither a `Dataset` directory nor a `DatasetDict` directory.
But works with:
ds = load_dataset(
ds_type,
name=config_dataset.name,
data_files=config_dataset.path,
streaming=False,
split=None,
)
Hey, this should be closed I guess.
I've tried loading local hf dataset (both parquet and json) and it worked smoothly! <3