datasets
datasets copied to clipboard
`with_format("numpy")` silently downcasts float64 to float32 features
Describe the bug
When I create a dataset with a float64 feature, then apply numpy formatting the returned numpy arrays are silently downcasted to float32.
Steps to reproduce the bug
import datasets
dataset = datasets.Dataset.from_dict({'a': [1.0, 2.0, 3.0]}).with_format("numpy")
print("feature dtype:", dataset.features['a'].dtype)
print("array dtype:", dataset['a'].dtype)
output:
feature dtype: float64
array dtype: float32
Expected behavior
feature dtype: float64
array dtype: float64
Environment info
datasetsversion: 2.8.0- Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyArrow version: 10.0.1
- Pandas version: 1.4.4
Suggested Fix
Changing the _tensorize function of the numpy formatter to
def _tensorize(self, value):
if isinstance(value, (str, bytes, type(None))):
return value
elif isinstance(value, (np.character, np.ndarray)) and np.issubdtype(value.dtype, np.character):
return value
elif isinstance(value, np.number):
return value
return np.asarray(value, **self.np_array_kwargs)
fixes this particular issue for me. Not sure if this would break other tests. This should also avoid unnecessary copying of the array.
Hi! This behavior stems from these lines:
https://github.com/huggingface/datasets/blob/b065547654efa0ec633cf373ac1512884c68b2e1/src/datasets/formatting/np_formatter.py#L45-L46
I agree we should preserve the original type whenever possible and downcast explicitly with a warning.
@lhoestq Do you remember why we need this "default dtype" logic in our formatters?
I was also wondering why the default type logic is needed. Me just deleting it is probably too naive of a solution.
Hmm I think the idea was to end up with the usual default precision for deep learning models - no matter how the data was stored or where it comes from.
For example in NLP we store tokens using an optimized low precision to save disk space, but when we set the format to torch we actually need to get int64. Although the need for a default for integers also comes from numpy not returning the same integer precision depending on your machine. Finally I guess we added a default for floats as well for consistency.
I'm a bit embarrassed by this though, as a user I'd have expected to get the same precision indeed as well and get a zero copy view.
Will you fix this or should I open a PR?
Unfortunately removing it for integers is a breaking change for most transformers + datasets users for NLP (which is a common case). Removing it for floats is a breaking change for transformers + datasets for ASR as well. And it also is a breaking change for the other users relying on this behavior.
Therefore I think that the only short term solution is for the user to provide dtype= manually and document better this behavior. We could also extend dtype to accept a value that means "return the same dtype as the underlying storage" and make it easier to do zero copy.
@lhoestq It should be fine to remove this conversion in Datasets 3.0, no? For now, we can warn the user (with a log message) about the future change when the default type is changed.
Let's see with the transformers team if it sounds reasonable ? We'd have to fix multiple example scripts though.
If it's not ok we can also explore keeping this behavior only for tokens and audio data.
IMO being coupled with Transformers can lead to unexpected behavior when one tries to use our lib without pairing it with Transformers, so I think it's still important to "fix" this, even if it means we will need to update Transformers' example scripts afterward.
Ideally let's update the transformers example scripts before the change :P
For others that run into the same issue: A temporary workaround for me is this:
def numpy_transform(batch):
return {key: np.asarray(val) for key, val in batch.items()}
dataset = dataset.with_transform(numpy_transform)
This behavior (silent upcast from int32 to int64) is also unexpected for the user in https://discuss.huggingface.co/t/standard-getitem-returns-wrong-data-type-for-arrays/62470/2
Hi, I stumbled on a variation that upcasts uint8 to int64. I would expect the dtype to be the same as it was when I generated the dataset.
import numpy as np
import datasets as ds
foo = np.random.randint(0, 256, size=(5, 10, 10), dtype=np.uint8)
features = ds.Features({"foo": ds.Array2D((10, 10), "uint8")})
dataset = ds.Dataset.from_dict({"foo": foo}, features=features)
dataset.set_format("torch")
print("feature dtype:", dataset.features["foo"].dtype)
print("array dtype:", dataset["foo"].dtype)
# feature dtype: uint8
# array dtype: torch.int64
workaround to remove torch upcasting
import datasets as ds
import torch
class FixedTorchFormatter(ds.formatting.TorchFormatter):
def _tensorize(self, value):
return torch.from_numpy(value)
ds.formatting._register_formatter(FixedTorchFormatter, "torch")