datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Problems after upgrading to 2.6.1

Open pietrolesci opened this issue 3 years ago • 10 comments

Describe the bug

Loading a dataset_dict from disk with load_from_disk is now creating a KeyError "length" that was not occurring in v2.5.2.

Context:

  • Each individual dataset in the dict is created with Dataset.from_pandas
  • The dataset_dict is create from a dict of Datasets, e.g., `DatasetDict({"train": train_ds, "validation": val_ds})
  • The pandas dataframe, besides text columns, has a column with a dictionary inside and potentially different keys in each row. Correctly the Dataset.from_pandas function adds key: None to all dictionaries in each row so that the schema can be correctly inferred.

Steps to reproduce the bug

Steps to reproduce:

  • Upgrade to datasets==2.6.1
  • Create a dataset from pandas dataframe with Dataset.from_pandas
  • Create a dataset_dict from a dict of Datasets, e.g., `DatasetDict({"train": train_ds, "validation": val_ds})
  • Save to disk with the save function

Expected behavior

Same as in v2.5.2, that is load from disk without errors

Environment info

  • datasets version: 2.6.1
  • Platform: Linux-5.4.209-129.367.amzn2int.x86_64-x86_64-with-glibc2.26
  • Python version: 3.9.13
  • PyArrow version: 9.0.0
  • Pandas version: 1.5.1

pietrolesci avatar Oct 24 '22 11:10 pietrolesci

Hi! I can't reproduce the error following these steps. Can you please provide a reproducible example?

mariosasko avatar Oct 24 '22 18:10 mariosasko

I faced the same issue:

Repro

!pip install datasets==2.6.1
import datasets as Dataset
dataset = Dataset.from_pandas(dataframe)
dataset.save_to_disk(local)

!pip install datasets==2.5.2
import datasets as Dataset
dataset = Dataset.load_from_disk(local)

Lokiiiiii avatar Oct 24 '22 19:10 Lokiiiiii

@Lokiiiiii And what are the contents of the "dataframe" in your example?

mariosasko avatar Oct 25 '22 12:10 mariosasko

I bumped into the issue too. @Lokiiiiii thanks for steps. I "solved" if for now by pip install datasets>=2.6.1 everywhere.

vvalouch avatar Oct 26 '22 16:10 vvalouch

Hi all, I experienced the same issue. Please note that the pull request is related to the IMDB example provided in the doc, and is a fix for that, in that context, to make sure that people can follow the doc example and have a working system. It does not provide a fix for Datasets itself.

maxpastor avatar Nov 18 '22 10:11 maxpastor

im getting the same error.

  • using the base AWS HF container that uses a datasets <2.
  • updating the AWS HF container to use dataset 2.4

d-v-dlee avatar Dec 08 '22 20:12 d-v-dlee

Same here, running on our SageMaker pipelines. It's only happening for some but not all of our saved Datasets.

cgpeltier avatar Dec 16 '22 15:12 cgpeltier

I am also receiving this error on Sagemaker but not locally, I have noticed that this occurs when the .dataset/ folder does not contain a single file like:

dataset.arrow

but instead contains multiple files like:

data-00000-of-00002.arrow data-00001-of-00002.arrow

I think that it may have something to do with this recent PR that updated the behaviour of dataset.save_to_disk by introducing sharding: https://github.com/huggingface/datasets/pull/5268

For now I can get around this by forcing datasets==2.8.0 on machine that creates dataset and in the huggingface instance for training (by running this at the start of training script os.system("pip install datasets==2.8.0"))

To ensure the dataset is a single shard when saving the dataset locally:

dataset.flatten_indices().save_to_disk('path/to/dataset', num_shards=1)

and then manually changing the name afterwards from path/to/dataset/data-00000-of-00001.arrow to path/to/dataset/dataset.arrow and updating the path/to/dataset/state.json to reflect this name change. i.e. by changing state.json to this:

{
  "_data_files": [
    {
      "filename": "dataset.arrow"
    }
  ],
  "_fingerprint": "420086f0636f8727",
  "_format_columns": null,
  "_format_kwargs": {},
  "_format_type": null,
  "_output_all_columns": false,
  "_split": null
}

mattdeeperinsights avatar Dec 22 '22 19:12 mattdeeperinsights

Does anyone know if this has been resolved?

svanhvitlilja avatar Dec 14 '23 14:12 svanhvitlilja

I have the same issue in datasets version 2.3.2

EdwardChang5467 avatar May 12 '24 07:05 EdwardChang5467