datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Loading CSV exported dataset has unexpected format

Open OrianeN opened this issue 1 year ago • 2 comments

Describe the bug

I wanted to be able to save a HF dataset for translations and load it again in another script, but I'm a bit confused with the documentation and the result I've got so I'm opening this issue to ask if this behavior is as expected.

Steps to reproduce the bug

The documentation I've mainly consulted is https://huggingface.co/docs/datasets/v2.16.1/en/package_reference/loading_methods#datasets.load_dataset and https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset (where I've found .to_csv())

# Load a dataset of translations
test_dataset = load_dataset("opus100", name="en-fr", split="test")

# Save with .to_csv()
test_csv_path = "try_testset_save.csv"
test_dataset.to_csv(test_csv_path)

# Load dataset from the CSV
loaded_dataset = load_dataset("csv", data_files=test_csv_path)
print(test_dataset_fromfile[0]["translation"])
print(test_dataset_fromfile[0]["translation"]["en"])
Creating CSV from Arrow format: 100%
2/2 [00:00<00:00, 47.99ba/s]
Downloading data files: 100%
1/1 [00:00<00:00, 65.33it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 42.10it/s]
Generating train split:
2000/0 [00:00<00:00, 47486.09 examples/s]

{'en': "She wasn't going to vaccinate her kid against polio,  no way.", 'fr': 'Elle ne vaccinerait pas son enfant contre la polio. Pas question.'}

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[29], line 11
      9 loaded_dataset = load_dataset("csv", data_files=test_csv_path)
     10 print(test_dataset_fromfile[0]["translation"])
---> 11 print(test_dataset_fromfile[0]["translation"]["en"])

TypeError: string indices must be integers, not 'str'

Expected behavior

Each translation was saved as a stringified dict like "{'en': ""She wasn't going to vaccinate her kid against polio, no way."", 'fr': 'Elle ne vaccinerait pas son enfant contre la polio. Pas question.'}" where I would have expected 2 columns (1st with English segments, and 2nd with French segments), and I was expecting load_dataset to infer the type of feature automatically as I haven't seen anything about it in the documentation.

Do you have an example of how to effectively save and load datasets of translations ?

Environment info

  • datasets version: 2.15.0
  • Platform: Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.11.5
  • huggingface_hub version: 0.16.4
  • PyArrow version: 14.0.2
  • Pandas version: 2.1.4
  • fsspec version: 2023.10.0

OrianeN avatar Jan 18 '24 14:01 OrianeN

Hi! Parquet is the only format that supports complex/nested features such as Translation. So, this should work:

test_dataset = load_dataset("opus100", name="en-fr", split="test")

# Save with .to_parquet()
test_parquet_path = "try_testset_save.parquet"
test_dataset.to_parquet(test_parquet_path)

# Load dataset from the Parquet
loaded_dataset = load_dataset("parquet", data_files=test_parquet_path)
print(test_dataset_fromfile[0]["translation"])
print(test_dataset_fromfile[0]["translation"]["en"])

mariosasko avatar Jan 22 '24 15:01 mariosasko

Indeed this works great, thank you !

OrianeN avatar Jan 23 '24 14:01 OrianeN