Problems after upgrading to 2.6.1
Describe the bug
Loading a dataset_dict from disk with load_from_disk is now creating a KeyError "length" that was not occurring in v2.5.2.
Context:
- Each individual dataset in the dict is created with
Dataset.from_pandas - The dataset_dict is create from a dict of
Datasets, e.g., `DatasetDict({"train": train_ds, "validation": val_ds}) - The pandas dataframe, besides text columns, has a column with a dictionary inside and potentially different keys in each row. Correctly the
Dataset.from_pandasfunction addskey: Noneto all dictionaries in each row so that the schema can be correctly inferred.
Steps to reproduce the bug
Steps to reproduce:
- Upgrade to datasets==2.6.1
- Create a dataset from pandas dataframe with
Dataset.from_pandas - Create a dataset_dict from a dict of
Datasets, e.g., `DatasetDict({"train": train_ds, "validation": val_ds}) - Save to disk with the
savefunction
Expected behavior
Same as in v2.5.2, that is load from disk without errors
Environment info
datasetsversion: 2.6.1- Platform: Linux-5.4.209-129.367.amzn2int.x86_64-x86_64-with-glibc2.26
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.5.1
Hi! I can't reproduce the error following these steps. Can you please provide a reproducible example?
I faced the same issue:
Repro
!pip install datasets==2.6.1
import datasets as Dataset
dataset = Dataset.from_pandas(dataframe)
dataset.save_to_disk(local)
!pip install datasets==2.5.2
import datasets as Dataset
dataset = Dataset.load_from_disk(local)
@Lokiiiiii And what are the contents of the "dataframe" in your example?
I bumped into the issue too. @Lokiiiiii thanks for steps. I "solved" if for now by pip install datasets>=2.6.1 everywhere.
Hi all, I experienced the same issue. Please note that the pull request is related to the IMDB example provided in the doc, and is a fix for that, in that context, to make sure that people can follow the doc example and have a working system. It does not provide a fix for Datasets itself.
im getting the same error.
- using the base AWS HF container that uses a datasets <2.
- updating the AWS HF container to use dataset 2.4
Same here, running on our SageMaker pipelines. It's only happening for some but not all of our saved Datasets.
I am also receiving this error on Sagemaker but not locally, I have noticed that this occurs when the .dataset/ folder does not contain a single file like:
dataset.arrow
but instead contains multiple files like:
data-00000-of-00002.arrow
data-00001-of-00002.arrow
I think that it may have something to do with this recent PR that updated the behaviour of dataset.save_to_disk by introducing sharding: https://github.com/huggingface/datasets/pull/5268
For now I can get around this by forcing datasets==2.8.0 on machine that creates dataset and in the huggingface instance for training (by running this at the start of training script os.system("pip install datasets==2.8.0"))
To ensure the dataset is a single shard when saving the dataset locally:
dataset.flatten_indices().save_to_disk('path/to/dataset', num_shards=1)
and then manually changing the name afterwards from path/to/dataset/data-00000-of-00001.arrow to path/to/dataset/dataset.arrow and updating the path/to/dataset/state.json to reflect this name change. i.e. by changing state.json to this:
{
"_data_files": [
{
"filename": "dataset.arrow"
}
],
"_fingerprint": "420086f0636f8727",
"_format_columns": null,
"_format_kwargs": {},
"_format_type": null,
"_output_all_columns": false,
"_split": null
}
Does anyone know if this has been resolved?
I have the same issue in datasets version 2.3.2