axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

Support loading a local hf dataset with `load_dataset`

Open ccdv-ai opened this issue 1 year ago • 2 comments
trafficstars

⚠️ Please check that this feature request hasn't been suggested before.

  • [X] I searched previous Ideas in Discussions didn't find any similar feature requests.
  • [X] I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Loading a local folder requires to specify the data files.

In some cases, users need to upload their dataset to the hf hub so that the files and pre-processing steps can be completed properly.

In practice, it is possible to use the load_dataset(...) function (instead of load_from_disk(...)) on a local folder to avoid the uploading step, this is especially usefull if there a custom dataset config file inside the folder.

✔️ Solution

Check this line and add some logic for the load_dataset(...) function to be used.

Note: load_from_disk is only intended to be used on directories created with Dataset.save_to_disk or DatasetDict.save_to_disk which is not the case for load_dataset(...).

Edit: Using load_from_disk if the directory is not created with Dataset.save_to_disk or DatasetDict.save_to_disk leads to this error: FileNotFoundError: Directory is neither a `Dataset` directory nor a `DatasetDict` directory.

This is not the case with load_dataset(...).

❓ Alternatives

No response

📝 Additional Context

No response

Acknowledgements

  • [X] My issue title is concise, descriptive, and in title casing.
  • [X] I have searched the existing issues to make sure this feature has not been requested yet.
  • [X] I have provided enough information for the maintainers to understand and evaluate this request.

ccdv-ai avatar Jun 22 '24 09:06 ccdv-ai

Can you recommend a way to reproduce this? What files should I be putting into the directory? Thanks!

winglian avatar Jun 24 '24 04:06 winglian

Can you recommend a way to reproduce this? What files should I be putting into the directory? Thanks!

Here a way to reproduce @winglian :

git clone https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test

Modify the script to load the folder locally:

datasets:
  - path: alpaca_2k_test
    type: alpaca

Returns: FileNotFoundError: Directory alpaca_2k_test is neither a `Dataset` directory nor a `DatasetDict` directory.

But works with:

ds = load_dataset(
            ds_type,
            name=config_dataset.name,
            data_files=config_dataset.path,
            streaming=False,
            split=None,
        )

ccdv-ai avatar Jun 24 '24 08:06 ccdv-ai

Hey, this should be closed I guess.

I've tried loading local hf dataset (both parquet and json) and it worked smoothly! <3

mertbozkir avatar Aug 14 '25 12:08 mertbozkir