datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Problems after upgrading to 2.6.1

Open pietrolesci opened this issue 2 years ago • 10 comments

Describe the bug

Loading a dataset_dict from disk with load_from_disk is now creating a KeyError "length" that was not occurring in v2.5.2.

Context:

  • Each individual dataset in the dict is created with Dataset.from_pandas
  • The dataset_dict is create from a dict of Datasets, e.g., `DatasetDict({"train": train_ds, "validation": val_ds})
  • The pandas dataframe, besides text columns, has a column with a dictionary inside and potentially different keys in each row. Correctly the Dataset.from_pandas function adds key: None to all dictionaries in each row so that the schema can be correctly inferred.

Steps to reproduce the bug

Steps to reproduce:

  • Upgrade to datasets==2.6.1
  • Create a dataset from pandas dataframe with Dataset.from_pandas
  • Create a dataset_dict from a dict of Datasets, e.g., `DatasetDict({"train": train_ds, "validation": val_ds})
  • Save to disk with the save function

Expected behavior

Same as in v2.5.2, that is load from disk without errors

Environment info

  • datasets version: 2.6.1
  • Platform: Linux-5.4.209-129.367.amzn2int.x86_64-x86_64-with-glibc2.26
  • Python version: 3.9.13
  • PyArrow version: 9.0.0
  • Pandas version: 1.5.1

pietrolesci avatar Oct 24 '22 11:10 pietrolesci