datasets icon indicating copy to clipboard operation
datasets copied to clipboard

`with_format("numpy")` silently downcasts float64 to float32 features

Open ernestum opened this issue 2 years ago • 13 comments

Describe the bug

When I create a dataset with a float64 feature, then apply numpy formatting the returned numpy arrays are silently downcasted to float32.

Steps to reproduce the bug

import datasets
dataset = datasets.Dataset.from_dict({'a': [1.0, 2.0, 3.0]}).with_format("numpy")
print("feature dtype:", dataset.features['a'].dtype)
print("array dtype:", dataset['a'].dtype)

output:

feature dtype: float64
array dtype: float32

Expected behavior

feature dtype: float64
array dtype: float64

Environment info

  • datasets version: 2.8.0
  • Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • PyArrow version: 10.0.1
  • Pandas version: 1.4.4

Suggested Fix

Changing the _tensorize function of the numpy formatter to

    def _tensorize(self, value):

        if isinstance(value, (str, bytes, type(None))):
            return value
        elif isinstance(value, (np.character, np.ndarray)) and np.issubdtype(value.dtype, np.character):
            return value
        elif isinstance(value, np.number):
            return value

        return np.asarray(value, **self.np_array_kwargs)

fixes this particular issue for me. Not sure if this would break other tests. This should also avoid unnecessary copying of the array.

ernestum avatar Feb 09 '23 14:02 ernestum

Hi! This behavior stems from these lines:

https://github.com/huggingface/datasets/blob/b065547654efa0ec633cf373ac1512884c68b2e1/src/datasets/formatting/np_formatter.py#L45-L46

I agree we should preserve the original type whenever possible and downcast explicitly with a warning.

@lhoestq Do you remember why we need this "default dtype" logic in our formatters?

mariosasko avatar Feb 09 '23 15:02 mariosasko

I was also wondering why the default type logic is needed. Me just deleting it is probably too naive of a solution.

ernestum avatar Feb 09 '23 15:02 ernestum

Hmm I think the idea was to end up with the usual default precision for deep learning models - no matter how the data was stored or where it comes from.

For example in NLP we store tokens using an optimized low precision to save disk space, but when we set the format to torch we actually need to get int64. Although the need for a default for integers also comes from numpy not returning the same integer precision depending on your machine. Finally I guess we added a default for floats as well for consistency.

I'm a bit embarrassed by this though, as a user I'd have expected to get the same precision indeed as well and get a zero copy view.

lhoestq avatar Feb 09 '23 15:02 lhoestq

Will you fix this or should I open a PR?

ernestum avatar Feb 09 '23 16:02 ernestum

Unfortunately removing it for integers is a breaking change for most transformers + datasets users for NLP (which is a common case). Removing it for floats is a breaking change for transformers + datasets for ASR as well. And it also is a breaking change for the other users relying on this behavior.

Therefore I think that the only short term solution is for the user to provide dtype= manually and document better this behavior. We could also extend dtype to accept a value that means "return the same dtype as the underlying storage" and make it easier to do zero copy.

lhoestq avatar Feb 09 '23 16:02 lhoestq

@lhoestq It should be fine to remove this conversion in Datasets 3.0, no? For now, we can warn the user (with a log message) about the future change when the default type is changed.

mariosasko avatar Feb 09 '23 16:02 mariosasko

Let's see with the transformers team if it sounds reasonable ? We'd have to fix multiple example scripts though.

If it's not ok we can also explore keeping this behavior only for tokens and audio data.

lhoestq avatar Feb 10 '23 14:02 lhoestq

IMO being coupled with Transformers can lead to unexpected behavior when one tries to use our lib without pairing it with Transformers, so I think it's still important to "fix" this, even if it means we will need to update Transformers' example scripts afterward.

mariosasko avatar Feb 13 '23 16:02 mariosasko

Ideally let's update the transformers example scripts before the change :P

lhoestq avatar Feb 13 '23 18:02 lhoestq

For others that run into the same issue: A temporary workaround for me is this:

def numpy_transform(batch):
    return {key: np.asarray(val) for key, val in batch.items()}

dataset = dataset.with_transform(numpy_transform)

ernestum avatar Feb 14 '23 15:02 ernestum

This behavior (silent upcast from int32 to int64) is also unexpected for the user in https://discuss.huggingface.co/t/standard-getitem-returns-wrong-data-type-for-arrays/62470/2

mariosasko avatar Nov 16 '23 15:11 mariosasko

Hi, I stumbled on a variation that upcasts uint8 to int64. I would expect the dtype to be the same as it was when I generated the dataset.

import numpy as np
import datasets as ds

foo = np.random.randint(0, 256, size=(5, 10, 10), dtype=np.uint8)

features = ds.Features({"foo": ds.Array2D((10, 10), "uint8")})
dataset = ds.Dataset.from_dict({"foo": foo}, features=features)
dataset.set_format("torch")
print("feature dtype:", dataset.features["foo"].dtype)
print("array dtype:", dataset["foo"].dtype)

# feature dtype: uint8
# array dtype: torch.int64

jarnoseppanen-sc avatar Jan 18 '24 08:01 jarnoseppanen-sc

workaround to remove torch upcasting

import datasets as ds
import torch

class FixedTorchFormatter(ds.formatting.TorchFormatter):
    def _tensorize(self, value):
        return torch.from_numpy(value)


ds.formatting._register_formatter(FixedTorchFormatter, "torch")

jarnoseppanen-sc avatar Jan 18 '24 08:01 jarnoseppanen-sc