Batched mapping of existing string column casts boolean to string
Describe the bug
Let the dataset contain a column named 'a', which is of the string type. If 'a' is converted to a boolean using batched mapping, the mapper automatically casts the boolean to a string (e.g., True -> 'true'). It only happens when the original column and the mapped column name are identical.
Thank you!
Steps to reproduce the bug
from datasets import Dataset
dset = Dataset.from_dict({'a': ['11', '22']})
dset = dset.map(lambda x: {'a': [True for _ in x['a']]}, batched=True)
print(dset['a'])
> ['true', 'true']
Expected behavior
[True, True]
Environment info
datasetsversion: 2.18.0- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.31
- Python version: 3.10.13
huggingface_hubversion: 0.21.4- PyArrow version: 15.0.2
- Pandas version: 2.2.1
fsspecversion: 2023.12.2
This seems to be hardcoded behavior in table.py array_cast.
if (
not allow_number_to_str
and pa.types.is_string(pa_type)
and (pa.types.is_floating(array.type) or pa.types.is_integer(array.type))
):
raise TypeError(
f"Couldn't cast array of type {array.type} to {pa_type} since allow_number_to_str is set to {allow_number_to_str}"
)
if pa.types.is_null(pa_type) and not pa.types.is_null(array.type):
raise TypeError(f"Couldn't cast array of type {array.type} to {pa_type}")
return array.cast(pa_type)
where floats and integers are not cast to string but booleans are. Maybe this should be extended to booleans?
Thanks for reporting! @Modexus Do you want to open a PR with the suggested fix?
I'll gladly create a PR but not sure what the behavior should be.
Should a value returned from map be cast to the current feature?
At the moment this seems very inconsistent since datetime is also cast (this would only fix boolean) but nested structures are not.
dset = Dataset.from_dict({"a": ["Hello world!"]})
dset = dset.map(lambda x: {"a": date(2021, 1, 1)})
# dset[0]["a"] == '2021-01-01'
dset = Dataset.from_dict({"a": ["Hello world!"]})
dset = dset.map(lambda x: {"a": [True]})
# dset[0]["a"] == [True]
Is there are reason to cast the value if the user doesn't specify it explicitly? Seems tricky that some things are cast and some are not.
Indeed, it also makes sense to raise a TypeError for temporal and decimal types.
Is there are reason to cast the value if the user doesn't specify it explicitly?
This is how PyArrow's built-in cast behaves - it allows casting from primitive types to strings. Hence, we need allow_number_to_str to disallow such casts (e.g., in the scenario when we are "trying a type" to preserve the original type if there is a column in the output dataset with the same name as in the input one).
PS: In the PR, we can introduce allow_numeric_to_str (for floats, integers, decimals, booleans) and allow_temporal_to_str (for dates, timestamps, ...) and deprecate allow_number_to_str to make it clear what each parameter does.
Would just allow_primitive_to_str work?
This should include all numeric, boolean and temporalformats.
Note that at least in the C++ implementation numeric seems to exclude boolean.
Indeed, allow_primitive_to_str sounds better.
PS: PyArrow's pa.types.is_primitive returns False for decimal types, but I think is okay for us to treat decimals as primitive types (or we can have allow_decimal_to_str to be fully consistent with PyArrow)
Fixed by:
- #6811