datasets
datasets copied to clipboard
Loading CSV exported dataset has unexpected format
Describe the bug
I wanted to be able to save a HF dataset for translations and load it again in another script, but I'm a bit confused with the documentation and the result I've got so I'm opening this issue to ask if this behavior is as expected.
Steps to reproduce the bug
The documentation I've mainly consulted is https://huggingface.co/docs/datasets/v2.16.1/en/package_reference/loading_methods#datasets.load_dataset and https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset (where I've found .to_csv()
)
# Load a dataset of translations
test_dataset = load_dataset("opus100", name="en-fr", split="test")
# Save with .to_csv()
test_csv_path = "try_testset_save.csv"
test_dataset.to_csv(test_csv_path)
# Load dataset from the CSV
loaded_dataset = load_dataset("csv", data_files=test_csv_path)
print(test_dataset_fromfile[0]["translation"])
print(test_dataset_fromfile[0]["translation"]["en"])
Creating CSV from Arrow format: 100%
2/2 [00:00<00:00, 47.99ba/s]
Downloading data files: 100%
1/1 [00:00<00:00, 65.33it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 42.10it/s]
Generating train split:
2000/0 [00:00<00:00, 47486.09 examples/s]
{'en': "She wasn't going to vaccinate her kid against polio, no way.", 'fr': 'Elle ne vaccinerait pas son enfant contre la polio. Pas question.'}
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[29], line 11
9 loaded_dataset = load_dataset("csv", data_files=test_csv_path)
10 print(test_dataset_fromfile[0]["translation"])
---> 11 print(test_dataset_fromfile[0]["translation"]["en"])
TypeError: string indices must be integers, not 'str'
Expected behavior
Each translation was saved as a stringified dict like "{'en': ""She wasn't going to vaccinate her kid against polio, no way."", 'fr': 'Elle ne vaccinerait pas son enfant contre la polio. Pas question.'}"
where I would have expected 2 columns (1st with English segments, and 2nd with French segments), and I was expecting load_dataset
to infer the type of feature automatically as I haven't seen anything about it in the documentation.
Do you have an example of how to effectively save and load datasets of translations ?
Environment info
-
datasets
version: 2.15.0 - Platform: Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.11.5
-
huggingface_hub
version: 0.16.4 - PyArrow version: 14.0.2
- Pandas version: 2.1.4
-
fsspec
version: 2023.10.0
Hi! Parquet is the only format that supports complex/nested features such as Translation
. So, this should work:
test_dataset = load_dataset("opus100", name="en-fr", split="test")
# Save with .to_parquet()
test_parquet_path = "try_testset_save.parquet"
test_dataset.to_parquet(test_parquet_path)
# Load dataset from the Parquet
loaded_dataset = load_dataset("parquet", data_files=test_parquet_path)
print(test_dataset_fromfile[0]["translation"])
print(test_dataset_fromfile[0]["translation"]["en"])
Indeed this works great, thank you !