vision icon indicating copy to clipboard operation
vision copied to clipboard

DataLoader multiprocessing fails when dataset stores a module object (TypeError: cannot pickle 'module' object)

Open nehdiii opened this issue 4 months ago • 2 comments

🐛 Describe the bug

When a custom Dataset stores a Python module object (e.g. self.h5py = h5py) as an attribute, DataLoader with num_workers > 0 fails with TypeError: cannot pickle 'module' object.

This happens because DataLoader uses multiprocessing with spawn/fork, which requires pickling the dataset. Module objects are not picklable, but the error message is cryptic and not clearly tied to the dataset.


To Reproduce

Code


import torch
from torch.utils.data import Dataset, DataLoader
import h5py  # any module works, h5py is just an example
import torch 
import os 



class MyDataset(Dataset):
    def __init__(self):
        # ⚠ storing a module object on self
        self.h5py = h5py

    def __len__(self):
        return 10

    def __getitem__(self, idx):
        return idx



def distribute_loader(loader):
    return torch.utils.data.DataLoader(
        loader.dataset,
        batch_size=loader.batch_size // torch.distributed.get_world_size(),
        sampler=torch.utils.data.distributed.DistributedSampler(
            loader.dataset,
            num_replicas=torch.distributed.get_world_size(),
            rank=torch.distributed.get_rank(),
        ),
        num_workers=loader.num_workers,
        pin_memory=loader.pin_memory,
    )

def main(rank):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = str(12355)

    # initialize the process group
    torch.distributed.init_process_group(
        "nccl",
        rank=rank,
        world_size=4,
    )
    torch.cuda.set_device(rank)
    torch.distributed.barrier()

    ds = MyDataset()
    loader = DataLoader(ds, batch_size=4, num_workers=2)  # num_workers > 0 triggers the bug
    ddp_loader = distribute_loader(loader)
    for i, batch in enumerate(ddp_loader):
        print(batch)

    torch.distributed.destroy_process_group()

if __name__ == "__main__":
    torch.multiprocessing.spawn(main, nprocs=4)

Error

Traceback (most recent call last):
  File "test_pickle.py", line 19, in <module>
    for batch in loader:
  File ".../torch/utils/data/dataloader.py", line ...
TypeError: cannot pickle 'module' object

Expected behavior

Either:

  1. Provide a clearer error message (e.g. "Dataset objects must not store module objects; they cannot be pickled for multiprocessing"), or
  2. Allow safe serialization by ignoring unpicklable module attributes.

Additional context

Workaround: do not store modules on self. For example:

# instead of
self.h5py = h5py

# just import h5py inside __getitem__ or methods
import h5py

Do you want me to also prepare a shorter minimal version (without h5py, e.g. storing self.os = __import__("os")) so the repro doesn’t depend on h5py? That might make it even cleaner for maintainers.

i found this problem in torchvision.dataset.PCAM

 try:
            import h5py

            self.h5py = h5py
        except ImportError:
            raise RuntimeError(
                "h5py is not found. This dataset needs to have h5py installed: please run pip install h5py"
            )

Versions

python = 3.10.18 torch = 2.7.0+cu128 torchvision = 0.22.0+cu128

nehdiii avatar Aug 27 '25 19:08 nehdiii

@nehdiii, I think this is really two issues:

  1. torch.utils.data.DataLoader depends on multiprocessing for parallelism, and it must be able to pickle all Dataset objects to send to parallel workers. If something in a Dataset is not pickleable, you'll encounter the error you're seeing. However, the TorchVision repo (this one) is not the appropriate place to request a better error message. The appropriate place is in the PyTorch repo (https://github.com/pytorch/pytorch).
  2. Because of the above problem, the torchvision.datasets.PCAM dataset cannot be used with torch.utils.data.DataLoader with multiple workers. Specifically, the problem are these lines, as you pointed out: https://github.com/pytorch/vision/blob/ce5b26a9284d119fe9fc63d0552abb35e0ba802a/torchvision/datasets/pcam.py#L81-L84 I understand this is a frustrating limitation for you right now. If you would like to propose a tested change, we welcome a PR. However, because of our current focus (see #9036), we will likely not prioritize improving this dataset ourselves.

scotts avatar Aug 30 '25 02:08 scotts

Hi @scotts i submitted a PR for the second point pytorch/vision#9200

nehdiii avatar Aug 30 '25 23:08 nehdiii