datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Can't create a dataset with `float16` features

Open dconathan opened this issue 2 years ago • 3 comments

Describe the bug

I can't create a dataset with float16 features.

I understand from the traceback that this is a pyarrow error, but I don't see anywhere in the datasets documentation about how to successfully do this. Is it actually supported? I've tried older versions of pyarrow as well with the same exact error.

The bug seems to arise from datasets casting the values to double and then pyarrow doesn't know how to convert those back to float16... does that sound right? Is there a way to bypass this since it's not necessary in the numpy and torch cases?

Thanks!

Steps to reproduce the bug

All of the following raise the following error with the same exact (as far as I can tell) traceback:

ArrowNotImplementedError: Unsupported cast from double to halffloat using function cast_half_float
from datasets import Dataset, Features, Value
Dataset.from_dict({"x": [0.0, 1.0, 2.0]}, features=Features(x=Value("float16")))

import numpy as np
Dataset.from_dict({"x": np.arange(3, dtype=np.float16)}, features=Features(x=Value("float16")))

import torch
Dataset.from_dict({"x": torch.arange(3).to(torch.float16)}, features=Features(x=Value("float16")))

Expected results

A dataset with float16 features is successfully created.

Actual results

---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
Cell In [14], line 1
----> 1 Dataset.from_dict({"x": [1.0, 2.0, 3.0]}, features=Features(x=Value("float16")))

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/arrow_dataset.py:870, in Dataset.from_dict(cls, mapping, features, info, split)
    865     mapping = features.encode_batch(mapping)
    866 mapping = {
    867     col: OptimizedTypedSequence(data, type=features[col] if features is not None else None, col=col)
    868     for col, data in mapping.items()
    869 }
--> 870 pa_table = InMemoryTable.from_pydict(mapping=mapping)
    871 if info.features is None:
    872     info.features = Features({col: ts.get_inferred_type() for col, ts in mapping.items()})

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/table.py:750, in InMemoryTable.from_pydict(cls, *args, **kwargs)
    734 @classmethod
    735 def from_pydict(cls, *args, **kwargs):
    736     """
    737     Construct a Table from Arrow arrays or columns
    738 
   (...)
    748         :class:`datasets.table.Table`:
    749     """
--> 750     return cls(pa.Table.from_pydict(*args, **kwargs))

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/table.pxi:3648, in pyarrow.lib.Table.from_pydict()

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/table.pxi:5174, in pyarrow.lib._from_pydict()

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/array.pxi:343, in pyarrow.lib.asarray()

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/array.pxi:231, in pyarrow.lib.array()

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/array.pxi:110, in pyarrow.lib._handle_arrow_array_protocol()

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/arrow_writer.py:197, in TypedSequence.__arrow_array__(self, type)
    192     # otherwise we can finally use the user's type
    193     elif type is not None:
    194         # We use cast_array_to_feature to support casting to custom types like Audio and Image
    195         # Also, when trying type "string", we don't want to convert integers or floats to "string".
    196         # We only do it if trying_type is False - since this is what the user asks for.
--> 197         out = cast_array_to_feature(out, type, allow_number_to_str=not self.trying_type)
    198     return out
    199 except (TypeError, pa.lib.ArrowInvalid) as e:  # handle type errors and overflows

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/table.py:1683, in _wrap_for_chunked_arrays.<locals>.wrapper(array, *args, **kwargs)
   1681     return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
   1682 else:
-> 1683     return func(array, *args, **kwargs)

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/table.py:1853, in cast_array_to_feature(array, feature, allow_number_to_str)
   1851     return array_cast(array, get_nested_type(feature), allow_number_to_str=allow_number_to_str)
   1852 elif not isinstance(feature, (Sequence, dict, list, tuple)):
-> 1853     return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
   1854 raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/table.py:1683, in _wrap_for_chunked_arrays.<locals>.wrapper(array, *args, **kwargs)
   1681     return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
   1682 else:
-> 1683     return func(array, *args, **kwargs)

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/datasets/table.py:1762, in array_cast(array, pa_type, allow_number_to_str)
   1760     if pa.types.is_null(pa_type) and not pa.types.is_null(array.type):
   1761         raise TypeError(f"Couldn't cast array of type {array.type} to {pa_type}")
-> 1762     return array.cast(pa_type)
   1763 raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{pa_type}")

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/array.pxi:919, in pyarrow.lib.Array.cast()

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/compute.py:389, in cast(arr, target_type, safe, options)
    387     else:
    388         options = CastOptions.safe(target_type)
--> 389 return call_function("cast", [arr], options)

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/_compute.pyx:560, in pyarrow._compute.call_function()

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/_compute.pyx:355, in pyarrow._compute.Function.call()

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/scratch/scratch-env-39/.venv/lib/python3.9/site-packages/pyarrow/error.pxi:121, in pyarrow.lib.check_status()

ArrowNotImplementedError: Unsupported cast from double to halffloat using function cast_half_float

Environment info

  • datasets version: 2.4.0
  • Platform: macOS-12.5.1-arm64-arm-64bit
  • Python version: 3.9.13
  • PyArrow version: 9.0.0
  • Pandas version: 1.4.4

dconathan avatar Sep 15 '22 21:09 dconathan

Hi @dconathan, thanks for reporting.

We rely on Arrow as a backend, and as far as I know currently support for float16 in Arrow is not fully implemented in Python (C++), hence the ArrowNotImplementedError you get.

See, e.g.: https://arrow.apache.org/docs/status.html?highlight=float16#data-types

albertvillanova avatar Sep 16 '22 07:09 albertvillanova

Thanks for the link…. didn’t realize arrow didn’t support it yet. Should it be removed from https://huggingface.co/docs/datasets/v2.4.0/en/package_reference/main_classes#datasets.Value until Arrow supports it?

dconathan avatar Sep 16 '22 09:09 dconathan

Yes, you are right: maybe we should either remove it from our docs or add a comment explaining the issue.

The thing is that in Arrow it is partially supported: you can create float16 values, but you can't cast them from/to other types. And current implementation of Value always tries to perform a cast from float64 to float16.

albertvillanova avatar Sep 16 '22 09:09 albertvillanova

Maybe we can just add a note in the Value documentation ?

lhoestq avatar Sep 26 '22 09:09 lhoestq

Would you accept a PR to fix this? @lhoestq Do you have an idea of how hard it would be to fix?

norabelrose avatar Feb 24 '23 00:02 norabelrose

I think the issue comes mostly from pyarrow not supporting float16 completely.

For example you stil can't cast from/to float16

import numpy as np
import pyarrow as pa

pa.array(range(5)).cast(pa.float16())
# ArrowNotImplementedError: Unsupported cast from int64 to halffloat using function cast_half_float
pa.array(range(5), pa.float32()).cast(pa.float16())
# ArrowNotImplementedError: Unsupported cast from float to halffloat using function cast_half_float
pa.array(range(5), pa.float16())
# ArrowTypeError: Expected np.float16 instance
pa.array(np.arange(5, dtype=np.float16())).cast(pa.float32())
# ArrowNotImplementedError: Unsupported cast from halffloat to float using function cast_float

lhoestq avatar Feb 28 '23 11:02 lhoestq

Hmm it seems like we can either:

  1. try to fix pyarrow upstream
  2. half-support float16 with some workaround to make sure we don't ever do casting internally

norabelrose avatar Mar 22 '23 21:03 norabelrose