datasets Problems after upgrading to 2.6.1

Problems after upgrading to 2.6.1

Open pietrolesci opened this issue 2 years ago • 10 comments

Loading a dataset_dict from disk with load_from_disk is now creating a KeyError "length" that was not occurring in v2.5.2.

Context:

Each individual dataset in the dict is created with Dataset.from_pandas
The dataset_dict is create from a dict of Datasets, e.g., `DatasetDict({"train": train_ds, "validation": val_ds})
The pandas dataframe, besides text columns, has a column with a dictionary inside and potentially different keys in each row. Correctly the Dataset.from_pandas function adds key: None to all dictionaries in each row so that the schema can be correctly inferred.

Steps to reproduce:

Upgrade to datasets==2.6.1
Create a dataset from pandas dataframe with Dataset.from_pandas
Create a dataset_dict from a dict of Datasets, e.g., `DatasetDict({"train": train_ds, "validation": val_ds})
Save to disk with the save function

Same as in v2.5.2, that is load from disk without errors

Oct 24 '22 11:10 pietrolesci