Dataset.to_csv() missing commas in columns with lists
Describe the bug
The to_csv() method does not output commas in lists. So when the Dataset is loaded back in the data structure of the column with a list is not correct.
Here's an example:
Obviously, it's not as trivial as inserting commas in the list, since its a comma-separated file. But hopefully there's a way to export the list in a way that it'll be imported by load_dataset() correctly.
Steps to reproduce the bug
Here's some code to reproduce the bug:
from datasets import Dataset
ds = Dataset.from_dict(
{
"pokemon": ["bulbasaur", "squirtle"],
"type": ["grass", "water"]
}
)
def ascii_to_hex(text):
return [ord(c) for c in text]
ds = ds.map(lambda x: {"int": ascii_to_hex(x['pokemon'])})
ds.to_csv('../output/temp.csv')
temp.csv then contains:
### Expected behavior
ACTUAL OUTPUT:
pokemon,type,int bulbasaur,grass,[ 98 117 108 98 97 115 97 117 114] squirtle,water,[115 113 117 105 114 116 108 101]
EXPECTED OUTPUT:
pokemon,type,int bulbasaur,grass,[98, 117, 108, 98, 97, 115, 97, 117, 114] squirtle,water,[115, 113, 117, 105, 114, 116, 108, 101]
or probably something more like this since it's a CSV file:
pokemon,type,int bulbasaur,grass,"[98, 117, 108, 98, 97, 115, 97, 117, 114]" squirtle,water,"[115, 113, 117, 105, 114, 116, 108, 101]"
### Environment info
### Package Version
Name: datasets
Version: 2.16.1
### Python
version: 3.10.12
### OS Info
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
...
UBUNTU_CODENAME=jammy
Hello!
This is due to how pandas write numpy arrays to csv. Source To fix this, you can convert them to list yourselves.
df = ds.to_pandas()
df['int'] = df['int'].apply(lambda arr: list(arr))
df.to_csv(index=False, '../output/temp.csv')
I think it would be good if datasets would do the conversion itself, but it's a breaking change and I would wait for the greenlight from someone from HF.