problem with dataload2 chunk
🐛 Describe the bug
Dataloader2 returned DataChunk, and can not be moved to device.
def TFRLoader(path):
record_pipe = FileLister(path)
file_pipe = FileOpener(record_pipe, mode="b")
return file_pipe.load_from_tfrecord().map(tfrecord_praser).batch(batch_size)
rs = MultiProcessingReadingService(num_workers=cfg.num_workers)
train_dataloader = DataLoader2(train_dataset, reading_service=rs)
for i, d in enumerate(train_dataloader):
d = d.to(device)
AttributeError: 'DataChunk' object has no attribute 'to'
also can not input to model
TypeError: conv2d() received an invalid combination of arguments - got (DataChunk, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of:
* (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups)
didn't match because some of the arguments have invalid types: (DataChunk of [Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor,
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Currently a work around is convert chunk to tensor
d = torch.stack(d.items)
Versions
Latest
You need to make the batch of data to Tensor before sending it to device. This is expected.
You need to make the batch of data to
Tensorbefore sending it to device. This is expected.
thanks for replying.
but according to this tutorial, it seems I don't need to convert. also I can direct send it into model.
https://pytorch.org/data/beta/dp_tutorial.html
https://pytorch.org/data/beta/dlv2_tutorial.html
datapipe = IterableWrapper(["./train1.csv", "./train2.csv"])
datapipe = datapipe.open_files(encoding="utf-8").parse_csv()
datapipe = datapipe.shuffle().sharding_filter()
datapipe = datapipe.map(fn).batch(8)
rs = MultiProcessingReadingService(num_workers=4)
dl = DataLoader2(datapipe, reading_service=rs)
for epoch in range(10):
dl.seed(epoch)
for d in dl:
model(d)
dl.shutdown()
but according to this tutorial, it seems I don't need to convert.
I think it's focusing on using DataLoader (not DataLoader2) to load data from DataPipe. And, DataLoaderdoesn't need to convert because it implicitly do the conversion butDataLoader2doesn't have such ability as we want to provide the flexibility for users to define the pipeline. You can simply dodp.collate()` to do the conversion.
I think we probably need to deprecate the tutorial of loading datapipe via DataLoader.
Thanks, collate() is working. Besides, I have found that there are many details to pay attention to.
For example, I don't know why the map function is using multiprocessing even though I haven't set it, and Dataloader is causing an infinite loop of the tfrecord pipe, I thought it was something wrong with tfrecord pipe...
And having a good tutorial example would be really helpful.