datasets Slow iteration over Torch tensors

Describe the bug

I have a problem related to this issue: I get a way slower iteration when using a Torch dataloader if I use vanilla Numpy tensors or if I first apply a ToTensor transform to the input. In particular, it takes 5 seconds to iterate over the vanilla input and ~30s after the transformation.

Steps to reproduce the bug

Here is the minimum code to reproduce the problem

import numpy as np
from datasets import Dataset, DatasetDict, load_dataset, Array3D, Image, Features
from torch.utils.data import DataLoader
from tqdm import tqdm
import torchvision 
from torchvision.transforms import ToTensor, Normalize


#################################
# Without transform
#################################
    
train_dataset = load_dataset(
    'cifar100',
    split='train',
    use_auth_token=True,
)

train_dataset.set_format(type="numpy", columns=["img", "fine_label"])

train_loader= DataLoader(
    train_dataset,
    batch_size=100,
    pin_memory=False,
    shuffle=True,
    num_workers=8,
)

for batch in tqdm(train_loader, desc="Loading data, no transform"):
    pass


#################################
# With transform
#################################

transform_func = torchvision.transforms.Compose([
    ToTensor(), 
    Normalize(mean=[0.485, 0.456, 0.406], std= [0.229, 0.224, 0.225]),]          
)
    
train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"img": transform_func(x["img"])},
)

train_dataset.set_format(type="numpy", columns=["img", "fine_label"])


train_loader= DataLoader(
    train_dataset,
    batch_size=100,
    pin_memory=False,
    shuffle=True,
    num_workers=8,
)


for batch in tqdm(train_loader, desc="Loading data after transform"):
    pass

I have also tried converting the Image column to an Array3D

img_shape = train_dataset[0]["img"].shape

features = train_dataset.features.copy()
features["x"] = Array3D(shape=img_shape, dtype="float32")

train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"x": np.array(x["img"], dtype=np.uint8)},
    features=features,
)
train_dataset.cast_column("x", Array3D(shape=img_shape, dtype="float32"))
train_dataset.set_format(type="numpy", columns=["x", "fine_label"])

but to no avail. Any clue?

Expected behavior

The iteration should take approximately the same time with or without the transformation, as it doesn't change the shape of the input. What may be the issue here?

Environment info

- `datasets` version: 2.12.0
- Platform: Linux-5.4.0-137-generic-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.14.1
- PyArrow version: 12.0.0
- Pandas version: 2.0.1

May 15 '23 16:05 crisostomi

I am highly interested performance of dataset so I ran your example as a curious user.

train_dataset.cast_column("x", Array3D(shape=img_shape, dtype="float32"))

have return values and "x" is a new column, it shoulde be

ds=train_dataset.cast_column("img", Array3D(shape=(3,32,32), dtype="float32"))

I rewrite your example as

train_dataset = load_dataset(
    'cifar100',
    split='train',
    use_auth_token=True,
)
transform_func = torchvision.transforms.Compose([
    ToTensor(), 
    Normalize(mean=[0.485, 0.456, 0.406], std= [0.229, 0.224, 0.225]),]          
)
    
train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"img": transform_func(x["img"])},
)
ds=train_dataset.cast_column("img", Array3D(shape=(3,32,32), dtype="float32"))
for i in tqdm(ds):
    pass

that require ~11s in my environment. While

ds = load_dataset(
    'cifar100',
    split='train',
    use_auth_token=True,
)

for i in tqdm(ds):
    pass

only need ~6s. (So I guess it's still undesirable)

May 15 '23 18:05 fecet

perhaps related to https://github.com/huggingface/datasets/issues/6833

Oct 08 '24 10:10 alex-hh

datasets datasets copied to clipboard

Slow iteration over Torch tensors

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

datasets
datasets copied to clipboard