datasets
datasets copied to clipboard
Problems after upgrading to 2.6.1
Describe the bug
Loading a dataset_dict from disk with load_from_disk
is now creating a KeyError "length"
that was not occurring in v2.5.2.
Context:
- Each individual dataset in the dict is created with
Dataset.from_pandas
- The dataset_dict is create from a dict of
Dataset
s, e.g., `DatasetDict({"train": train_ds, "validation": val_ds}) - The pandas dataframe, besides text columns, has a column with a dictionary inside and potentially different keys in each row. Correctly the
Dataset.from_pandas
function addskey: None
to all dictionaries in each row so that the schema can be correctly inferred.
Steps to reproduce the bug
Steps to reproduce:
- Upgrade to datasets==2.6.1
- Create a dataset from pandas dataframe with
Dataset.from_pandas
- Create a dataset_dict from a dict of
Dataset
s, e.g., `DatasetDict({"train": train_ds, "validation": val_ds}) - Save to disk with the
save
function
Expected behavior
Same as in v2.5.2, that is load from disk without errors
Environment info
-
datasets
version: 2.6.1 - Platform: Linux-5.4.209-129.367.amzn2int.x86_64-x86_64-with-glibc2.26
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.5.1