datasets
datasets copied to clipboard
Slow iteration over Torch tensors
Describe the bug
I have a problem related to this issue: I get a way slower iteration when using a Torch dataloader if I use vanilla Numpy tensors or if I first apply a ToTensor transform to the input. In particular, it takes 5 seconds to iterate over the vanilla input and ~30s after the transformation.
Steps to reproduce the bug
Here is the minimum code to reproduce the problem
import numpy as np
from datasets import Dataset, DatasetDict, load_dataset, Array3D, Image, Features
from torch.utils.data import DataLoader
from tqdm import tqdm
import torchvision
from torchvision.transforms import ToTensor, Normalize
#################################
# Without transform
#################################
train_dataset = load_dataset(
'cifar100',
split='train',
use_auth_token=True,
)
train_dataset.set_format(type="numpy", columns=["img", "fine_label"])
train_loader= DataLoader(
train_dataset,
batch_size=100,
pin_memory=False,
shuffle=True,
num_workers=8,
)
for batch in tqdm(train_loader, desc="Loading data, no transform"):
pass
#################################
# With transform
#################################
transform_func = torchvision.transforms.Compose([
ToTensor(),
Normalize(mean=[0.485, 0.456, 0.406], std= [0.229, 0.224, 0.225]),]
)
train_dataset = train_dataset.map(
desc=f"Preprocessing samples",
function=lambda x: {"img": transform_func(x["img"])},
)
train_dataset.set_format(type="numpy", columns=["img", "fine_label"])
train_loader= DataLoader(
train_dataset,
batch_size=100,
pin_memory=False,
shuffle=True,
num_workers=8,
)
for batch in tqdm(train_loader, desc="Loading data after transform"):
pass
I have also tried converting the Image column to an Array3D
img_shape = train_dataset[0]["img"].shape
features = train_dataset.features.copy()
features["x"] = Array3D(shape=img_shape, dtype="float32")
train_dataset = train_dataset.map(
desc=f"Preprocessing samples",
function=lambda x: {"x": np.array(x["img"], dtype=np.uint8)},
features=features,
)
train_dataset.cast_column("x", Array3D(shape=img_shape, dtype="float32"))
train_dataset.set_format(type="numpy", columns=["x", "fine_label"])
but to no avail. Any clue?
Expected behavior
The iteration should take approximately the same time with or without the transformation, as it doesn't change the shape of the input. What may be the issue here?
Environment info
- `datasets` version: 2.12.0
- Platform: Linux-5.4.0-137-generic-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.14.1
- PyArrow version: 12.0.0
- Pandas version: 2.0.1
I am highly interested performance of dataset so I ran your example as a curious user.
train_dataset.cast_column("x", Array3D(shape=img_shape, dtype="float32"))
have return values and "x" is a new column, it shoulde be
ds=train_dataset.cast_column("img", Array3D(shape=(3,32,32), dtype="float32"))
I rewrite your example as
train_dataset = load_dataset(
'cifar100',
split='train',
use_auth_token=True,
)
transform_func = torchvision.transforms.Compose([
ToTensor(),
Normalize(mean=[0.485, 0.456, 0.406], std= [0.229, 0.224, 0.225]),]
)
train_dataset = train_dataset.map(
desc=f"Preprocessing samples",
function=lambda x: {"img": transform_func(x["img"])},
)
ds=train_dataset.cast_column("img", Array3D(shape=(3,32,32), dtype="float32"))
for i in tqdm(ds):
pass
that require ~11s in my environment. While
ds = load_dataset(
'cifar100',
split='train',
use_auth_token=True,
)
for i in tqdm(ds):
pass
only need ~6s. (So I guess it's still undesirable)
perhaps related to https://github.com/huggingface/datasets/issues/6833