Bug: Type Mismatch in Dataset Mapping

Open marko1616 opened this issue 5 months ago • 3 comments

Issue: Type Mismatch in Dataset Mapping

Description

There is an issue with the map function in the datasets library where the mapped output does not reflect the expected type change. After applying a mapping function to convert an integer label to a string, the resulting type remains an integer instead of a string.

Reproduction Code

Below is a Python script that demonstrates the problem:

from datasets import Dataset

# Original data
data = {
    'text': ['Hello', 'world', 'this', 'is', 'a', 'test'],
    'label': [0, 1, 0, 1, 1, 0]
}

# Creating a Dataset object
dataset = Dataset.from_dict(data)

# Mapping function to convert label to string
def add_one(example):
    example['label'] = str(example['label'])
    return example

# Applying the mapping function
dataset = dataset.map(add_one)

# Iterating over the dataset to show results
for item in dataset:
    print(item)
    print(type(item['label']))

Expected Output

After applying the mapping function, the expected output should have the label field as strings:

{'text': 'Hello', 'label': '0'}
<class 'str'>
{'text': 'world', 'label': '1'}
<class 'str'>
{'text': 'this', 'label': '0'}
<class 'str'>
{'text': 'is', 'label': '1'}
<class 'str'>
{'text': 'a', 'label': '1'}
<class 'str'>
{'text': 'test', 'label': '0'}
<class 'str'>

Actual Output

The actual output still shows the label field values as integers:

{'text': 'Hello', 'label': 0}
<class 'int'>
{'text': 'world', 'label': 1}
<class 'int'>
{'text': 'this', 'label': 0}
<class 'int'>
{'text': 'is', 'label': 1}
<class 'int'>
{'text': 'a', 'label': 1}
<class 'int'>
{'text': 'test', 'label': 0}
<class 'int'>

Why necessary

In the case of Image process we often need to convert PIL to tensor with same column name.

Thank for every dev who review this issue. 🤗

Sep 03 '24 16:09 marko1616

datasets datasets copied to clipboard

Bug: Type Mismatch in Dataset Mapping

Issue: Type Mismatch in Dataset Mapping

Description

Reproduction Code

Expected Output

Actual Output

Why necessary

datasets
datasets copied to clipboard