datasets
datasets copied to clipboard
Can't create a dataset with `float16` features
Describe the bug
I can't create a dataset with float16
features.
I understand from the traceback that this is a pyarrow
error, but I don't see anywhere in the datasets
documentation about how to successfully do this. Is it actually supported? I've tried older versions of pyarrow
as well with the same exact error.
The bug seems to arise from datasets
casting the values to double
and then pyarrow
doesn't know how to convert those back to float16
... does that sound right? Is there a way to bypass this since it's not necessary in the numpy
and torch
cases?
Thanks!
Steps to reproduce the bug
All of the following raise the following error with the same exact (as far as I can tell) traceback:
ArrowNotImplementedError: Unsupported cast from double to halffloat using function cast_half_float
from datasets import Dataset, Features, Value
Dataset.from_dict({"x": [0.0, 1.0, 2.0]}, features=Features(x=Value("float16")))
import numpy as np
Dataset.from_dict({"x": np.arange(3, dtype=np.float16)}, features=Features(x=Value("float16")))
import torch
Dataset.from_dict({"x": torch.arange(3).to(torch.float16)}, features=Features(x=Value("float16")))
Expected results
A dataset with float16
features is successfully created.
Actual results
---------------------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
Cell In [14], line 1
----> 1 Dataset.from_dict({"x": [1.0, 2.0, 3.0]}, features=Features(x=Value("float16")))
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/arrow_dataset.py:870, in Dataset.from_dict(cls, mapping, features, info, split)
865 mapping = features.encode_batch(mapping)
866 mapping = {
867 col: OptimizedTypedSequence(data, type=features[col] if features is not None else None, col=col)
868 for col, data in mapping.items()
869 }
--> 870 pa_table = InMemoryTable.from_pydict(mapping=mapping)
871 if info.features is None:
872 info.features = Features({col: ts.get_inferred_type() for col, ts in mapping.items()})
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/table.py:750, in InMemoryTable.from_pydict(cls, *args, **kwargs)
734 @classmethod
735 def from_pydict(cls, *args, **kwargs):
736 """
737 Construct a Table from Arrow arrays or columns
738
(...)
748 :class:`datasets.table.Table`:
749 """
--> 750 return cls(pa.Table.from_pydict(*args, **kwargs))
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/table.pxi:3648, in pyarrow.lib.Table.from_pydict()
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/table.pxi:5174, in pyarrow.lib._from_pydict()
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/array.pxi:343, in pyarrow.lib.asarray()
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/array.pxi:231, in pyarrow.lib.array()
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/array.pxi:110, in pyarrow.lib._handle_arrow_array_protocol()
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/arrow_writer.py:197, in TypedSequence.__arrow_array__(self, type)
192 # otherwise we can finally use the user's type
193 elif type is not None:
194 # We use cast_array_to_feature to support casting to custom types like Audio and Image
195 # Also, when trying type "string", we don't want to convert integers or floats to "string".
196 # We only do it if trying_type is False - since this is what the user asks for.
--> 197 out = cast_array_to_feature(out, type, allow_number_to_str=not self.trying_type)
198 return out
199 except (TypeError, pa.lib.ArrowInvalid) as e: # handle type errors and overflows
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/table.py:1683, in _wrap_for_chunked_arrays.<locals>.wrapper(array, *args, **kwargs)
1681 return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
1682 else:
-> 1683 return func(array, *args, **kwargs)
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/table.py:1853, in cast_array_to_feature(array, feature, allow_number_to_str)
1851 return array_cast(array, get_nested_type(feature), allow_number_to_str=allow_number_to_str)
1852 elif not isinstance(feature, (Sequence, dict, list, tuple)):
-> 1853 return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
1854 raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/table.py:1683, in _wrap_for_chunked_arrays.<locals>.wrapper(array, *args, **kwargs)
1681 return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
1682 else:
-> 1683 return func(array, *args, **kwargs)
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/table.py:1762, in array_cast(array, pa_type, allow_number_to_str)
1760 if pa.types.is_null(pa_type) and not pa.types.is_null(array.type):
1761 raise TypeError(f"Couldn't cast array of type {array.type} to {pa_type}")
-> 1762 return array.cast(pa_type)
1763 raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{pa_type}")
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/array.pxi:919, in pyarrow.lib.Array.cast()
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/compute.py:389, in cast(arr, target_type, safe, options)
387 else:
388 options = CastOptions.safe(target_type)
--> 389 return call_function("cast", [arr], options)
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/_compute.pyx:560, in pyarrow._compute.call_function()
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/_compute.pyx:355, in pyarrow._compute.Function.call()
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/error.pxi:121, in pyarrow.lib.check_status()
ArrowNotImplementedError: Unsupported cast from double to halffloat using function cast_half_float
Environment info
-
datasets
version: 2.4.0 - Platform: macOS-12.5.1-arm64-arm-64bit
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.4.4
Hi @dconathan, thanks for reporting.
We rely on Arrow as a backend, and as far as I know currently support for float16
in Arrow is not fully implemented in Python (C++), hence the ArrowNotImplementedError
you get.
See, e.g.: https://arrow.apache.org/docs/status.html?highlight=float16#data-types
Thanks for the link…. didn’t realize arrow didn’t support it yet. Should it be removed from https://huggingface.co/docs/datasets/v2.4.0/en/package_reference/main_classes#datasets.Value until Arrow supports it?
Yes, you are right: maybe we should either remove it from our docs or add a comment explaining the issue.
The thing is that in Arrow it is partially supported: you can create float16
values, but you can't cast them from/to other types. And current implementation of Value
always tries to perform a cast from float64
to float16
.
Maybe we can just add a note in the Value
documentation ?
Would you accept a PR to fix this? @lhoestq Do you have an idea of how hard it would be to fix?
I think the issue comes mostly from pyarrow not supporting float16
completely.
For example you stil can't cast from/to float16
import numpy as np
import pyarrow as pa
pa.array(range(5)).cast(pa.float16())
# ArrowNotImplementedError: Unsupported cast from int64 to halffloat using function cast_half_float
pa.array(range(5), pa.float32()).cast(pa.float16())
# ArrowNotImplementedError: Unsupported cast from float to halffloat using function cast_half_float
pa.array(range(5), pa.float16())
# ArrowTypeError: Expected np.float16 instance
pa.array(np.arange(5, dtype=np.float16())).cast(pa.float32())
# ArrowNotImplementedError: Unsupported cast from halffloat to float using function cast_float
Hmm it seems like we can either:
- try to fix pyarrow upstream
- half-support float16 with some workaround to make sure we don't ever do casting internally