datasets
datasets copied to clipboard
Bug: Type Mismatch in Dataset Mapping
Issue: Type Mismatch in Dataset Mapping
Description
There is an issue with the map
function in the datasets
library where the mapped output does not reflect the expected type change. After applying a mapping function to convert an integer label to a string, the resulting type remains an integer instead of a string.
Reproduction Code
Below is a Python script that demonstrates the problem:
from datasets import Dataset
# Original data
data = {
'text': ['Hello', 'world', 'this', 'is', 'a', 'test'],
'label': [0, 1, 0, 1, 1, 0]
}
# Creating a Dataset object
dataset = Dataset.from_dict(data)
# Mapping function to convert label to string
def add_one(example):
example['label'] = str(example['label'])
return example
# Applying the mapping function
dataset = dataset.map(add_one)
# Iterating over the dataset to show results
for item in dataset:
print(item)
print(type(item['label']))
Expected Output
After applying the mapping function, the expected output should have the label
field as strings:
{'text': 'Hello', 'label': '0'}
<class 'str'>
{'text': 'world', 'label': '1'}
<class 'str'>
{'text': 'this', 'label': '0'}
<class 'str'>
{'text': 'is', 'label': '1'}
<class 'str'>
{'text': 'a', 'label': '1'}
<class 'str'>
{'text': 'test', 'label': '0'}
<class 'str'>
Actual Output
The actual output still shows the label
field values as integers:
{'text': 'Hello', 'label': 0}
<class 'int'>
{'text': 'world', 'label': 1}
<class 'int'>
{'text': 'this', 'label': 0}
<class 'int'>
{'text': 'is', 'label': 1}
<class 'int'>
{'text': 'a', 'label': 1}
<class 'int'>
{'text': 'test', 'label': 0}
<class 'int'>
Why necessary
In the case of Image process we often need to convert PIL to tensor with same column name.
Thank for every dev who review this issue. 🤗