datasets `with_format("numpy")` silently downcasts float64 to float32 features

Describe the bug

When I create a dataset with a float64 feature, then apply numpy formatting the returned numpy arrays are silently downcasted to float32.

Steps to reproduce the bug

import datasets
dataset = datasets.Dataset.from_dict({'a': [1.0, 2.0, 3.0]}).with_format("numpy")
print("feature dtype:", dataset.features['a'].dtype)
print("array dtype:", dataset['a'].dtype)

output:

feature dtype: float64
array dtype: float32

Expected behavior

feature dtype: float64
array dtype: float64

Environment info

datasets version: 2.8.0
Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.29
Python version: 3.8.10
PyArrow version: 10.0.1
Pandas version: 1.4.4

    def _tensorize(self, value):

        if isinstance(value, (str, bytes, type(None))):
            return value
        elif isinstance(value, (np.character, np.ndarray)) and np.issubdtype(value.dtype, np.character):
            return value
        elif isinstance(value, np.number):
            return value

        return np.asarray(value, **self.np_array_kwargs)

fixes this particular issue for me. Not sure if this would break other tests. This should also avoid unnecessary copying of the array.

Feb 09 '23 14:02 ernestum

Hi! This behavior stems from these lines:

https://github.com/huggingface/datasets/blob/b065547654efa0ec633cf373ac1512884c68b2e1/src/datasets/formatting/np_formatter.py#L45-L46

I agree we should preserve the original type whenever possible and downcast explicitly with a warning.

@lhoestq Do you remember why we need this "default dtype" logic in our formatters?

Feb 09 '23 15:02 mariosasko

I was also wondering why the default type logic is needed. Me just deleting it is probably too naive of a solution.

Feb 09 '23 15:02 ernestum

Hmm I think the idea was to end up with the usual default precision for deep learning models - no matter how the data was stored or where it comes from.

For example in NLP we store tokens using an optimized low precision to save disk space, but when we set the format to torch we actually need to get int64. Although the need for a default for integers also comes from numpy not returning the same integer precision depending on your machine. Finally I guess we added a default for floats as well for consistency.

I'm a bit embarrassed by this though, as a user I'd have expected to get the same precision indeed as well and get a zero copy view.

Feb 09 '23 15:02 lhoestq

Will you fix this or should I open a PR?

Feb 09 '23 16:02 ernestum

Unfortunately removing it for integers is a breaking change for most transformers + datasets users for NLP (which is a common case). Removing it for floats is a breaking change for transformers + datasets for ASR as well. And it also is a breaking change for the other users relying on this behavior.

Therefore I think that the only short term solution is for the user to provide dtype= manually and document better this behavior. We could also extend dtype to accept a value that means "return the same dtype as the underlying storage" and make it easier to do zero copy.

Feb 09 '23 16:02 lhoestq

@lhoestq It should be fine to remove this conversion in Datasets 3.0, no? For now, we can warn the user (with a log message) about the future change when the default type is changed.

Feb 09 '23 16:02 mariosasko

Let's see with the transformers team if it sounds reasonable ? We'd have to fix multiple example scripts though.

If it's not ok we can also explore keeping this behavior only for tokens and audio data.

Feb 10 '23 14:02 lhoestq

IMO being coupled with Transformers can lead to unexpected behavior when one tries to use our lib without pairing it with Transformers, so I think it's still important to "fix" this, even if it means we will need to update Transformers' example scripts afterward.

Feb 13 '23 16:02 mariosasko

Ideally let's update the transformers example scripts before the change :P

Feb 13 '23 18:02 lhoestq

For others that run into the same issue: A temporary workaround for me is this:

def numpy_transform(batch):
    return {key: np.asarray(val) for key, val in batch.items()}

dataset = dataset.with_transform(numpy_transform)

Feb 14 '23 15:02 ernestum

This behavior (silent upcast from int32 to int64) is also unexpected for the user in https://discuss.huggingface.co/t/standard-getitem-returns-wrong-data-type-for-arrays/62470/2

Nov 16 '23 15:11 mariosasko

Hi, I stumbled on a variation that upcasts uint8 to int64. I would expect the dtype to be the same as it was when I generated the dataset.

import numpy as np
import datasets as ds

foo = np.random.randint(0, 256, size=(5, 10, 10), dtype=np.uint8)

features = ds.Features({"foo": ds.Array2D((10, 10), "uint8")})
dataset = ds.Dataset.from_dict({"foo": foo}, features=features)
dataset.set_format("torch")
print("feature dtype:", dataset.features["foo"].dtype)
print("array dtype:", dataset["foo"].dtype)

# feature dtype: uint8
# array dtype: torch.int64

Jan 18 '24 08:01 jarnoseppanen-sc

workaround to remove torch upcasting

import datasets as ds
import torch

class FixedTorchFormatter(ds.formatting.TorchFormatter):
    def _tensorize(self, value):
        return torch.from_numpy(value)


ds.formatting._register_formatter(FixedTorchFormatter, "torch")

Jan 18 '24 08:01 jarnoseppanen-sc

datasets
datasets copied to clipboard

`with_format("numpy")` silently downcasts float64 to float32 features

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Suggested Fix

datasets datasets copied to clipboard

`with_format("numpy")` silently downcasts float64 to float32 features

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Suggested Fix

datasets
datasets copied to clipboard