`load_dataset` fails to load dataset saved by `save_to_disk`
Describe the bug
This code fails to load the dataset it just saved:
from datasets import load_dataset
from transformers import AutoTokenizer
MODEL = "google-bert/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
dataset = load_dataset("yelp_review_full")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.save_to_disk("dataset")
tokenized_datasets = load_dataset("dataset/") # raises
It raises ValueError: Couldn't infer the same data file format for all splits. Got {NamedSplit('train'): ('arrow', {}), NamedSplit('test'): ('json', {})}.
I believe this bug is caused by the logic that tries to infer dataset format. It counts the most common file extension. However, a small dataset can fit in a single .arrow file and have two JSON metadata files, causing the format to be inferred as JSON:
$ ls -l dataset/test
-rw-r--r-- 1 sliedes sliedes 191498784 Jul 1 13:55 data-00000-of-00001.arrow
-rw-r--r-- 1 sliedes sliedes 1730 Jul 1 13:55 dataset_info.json
-rw-r--r-- 1 sliedes sliedes 249 Jul 1 13:55 state.json
Steps to reproduce the bug
Execute the code above.
Expected behavior
The dataset is loaded successfully.
Environment info
-
datasetsversion: 2.20.0 - Platform: Linux-6.9.3-arch1-1-x86_64-with-glibc2.39
- Python version: 3.12.4
-
huggingface_hubversion: 0.23.4 - PyArrow version: 16.1.0
- Pandas version: 2.2.2
-
fsspecversion: 2024.5.0
In my case the error was:
ValueError: You are trying to load a dataset that was saved using `save_to_disk`. Please use `load_from_disk` instead.
Did you try load_from_disk?
More generally, any reason there is no API consistency between save_to_disk and push_to_hub ?
Would be nice to be able to save_to_disk and then upload manually to the hub and load_dataset (which works in some situations but not all)...
I have the exact same problem !
load_from_disk managed to load the dataset, but the bug with load_dataset needs to be fixed.
any update ? I need some function from load dataset like num_proc, or streaming to optimize ram, ... but got this error
I found that the very hard way. When each split has more than one arrow file they are marked correctly. When there is only one, json file wins. So adding the .arrow format in the sorter does the trick.