data problem with dataload2 chunk

🐛 Describe the bug

Dataloader2 returned DataChunk, and can not be moved to device.

def TFRLoader(path):
    record_pipe = FileLister(path)
    file_pipe = FileOpener(record_pipe, mode="b")
    return file_pipe.load_from_tfrecord().map(tfrecord_praser).batch(batch_size)

rs = MultiProcessingReadingService(num_workers=cfg.num_workers)
train_dataloader = DataLoader2(train_dataset, reading_service=rs)

for i, d in enumerate(train_dataloader):
    d = d.to(device)

AttributeError: 'DataChunk' object has no attribute 'to'

also can not input to model

TypeError: conv2d() received an invalid combination of arguments - got (DataChunk, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of:
 * (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups)
      didn't match because some of the arguments have invalid types: (DataChunk of [Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor,

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Currently a work around is convert chunk to tensor

d = torch.stack(d.items)

Versions

Latest

Apr 20 '23 17:04 jayagami

You need to make the batch of data to Tensor before sending it to device. This is expected.

Apr 20 '23 17:04 ejguan

You need to make the batch of data to Tensor before sending it to device. This is expected.

thanks for replying.

but according to this tutorial, it seems I don't need to convert. also I can direct send it into model.

https://pytorch.org/data/beta/dp_tutorial.html

https://pytorch.org/data/beta/dlv2_tutorial.html


datapipe = IterableWrapper(["./train1.csv", "./train2.csv"])
datapipe = datapipe.open_files(encoding="utf-8").parse_csv()
datapipe = datapipe.shuffle().sharding_filter()
datapipe = datapipe.map(fn).batch(8)


rs = MultiProcessingReadingService(num_workers=4)
dl = DataLoader2(datapipe, reading_service=rs)
for epoch in range(10):
    dl.seed(epoch)
    for d in dl:
        model(d)
dl.shutdown()

Apr 20 '23 17:04 jayagami

but according to this tutorial, it seems I don't need to convert.

I think it's focusing on using DataLoader (not DataLoader2) to load data from DataPipe. And, DataLoaderdoesn't need to convert because it implicitly do the conversion butDataLoader2doesn't have such ability as we want to provide the flexibility for users to define the pipeline. You can simply dodp.collate()` to do the conversion.

I think we probably need to deprecate the tutorial of loading datapipe via DataLoader.

Apr 20 '23 18:04 ejguan

Thanks, collate() is working. Besides, I have found that there are many details to pay attention to.

For example, I don't know why the map function is using multiprocessing even though I haven't set it, and Dataloader is causing an infinite loop of the tfrecord pipe, I thought it was something wrong with tfrecord pipe...

And having a good tutorial example would be really helpful.

Apr 21 '23 03:04 jayagami