datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Fix data file module inference

Open HennerM opened this issue 1 year ago • 3 comments

I saved a dataset with two splits to disk with DatasetDict.save_to_disk. The train is bigger and ended up in 10 shards, whereas the test split only resulted in 1 split.

Now when trying to load the dataset, an error is raised that not all splits have the same data format:

ValueError: Couldn't infer the same data file format for all splits. Got {NamedSplit('train'): ('arrow', {}), NamedSplit('test'): ('json', {})}

This is not expected because both splits are saved as arrow files.

I did some debugging and found that this is the case because the list of data_files includes a state.json file.

Now this means for train split I get 10 ".arrow" and 1 ".json" file. Since datasets picks based on the most common extension this is correctly inferred as "arrow". In the test split, there is 1 .arrow and 1 .json file. Given the function description:

It picks the module based on the most common file extension. In case of a draw ".parquet" is the favorite, and then alphabetical order.

This is not quite true though, because in a tie the extensions are actually based on reverse-alphabetical order:

for (ext, _), _ in sorted(extensions_counter.items(), key=sort_key, *reverse=True*):

Which thus leads to the module wrongly inferred as "json", whereas it should be "arrow", matching the train split.

I first thought about adding "state.json" in the list of excluded files for the inference: https://github.com/huggingface/datasets/blob/main/src/datasets/load.py#L513. However, I think from digging into the code it looks like the right thing to do is to exclude it in the list of data_files to start with, because it is more of a metadata than a data file.

HennerM avatar Aug 29 '24 13:08 HennerM

Hi ! datasets saved using save_to_disk should be loaded with load_from_disk ;)

lhoestq avatar Sep 02 '24 14:09 lhoestq

It is convienient to just pass in a path to a local dataset or one from the hub and use the same function to load it. Is it not possible to get this fix merged in to allow this?

HennerM avatar Sep 02 '24 16:09 HennerM

We can modify save_to_disk to write the dataset in a structure supported by the Hub in this case, it's kind of a legacy function anyway

lhoestq avatar Sep 02 '24 19:09 lhoestq