litdata
litdata copied to clipboard
Data shard delation with multi GPU does not work
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
Create a litdata set, stream the shard (image 224,224,3 + some text) and using mutli GPU using Bert + Resnet setting the max_cache_size="6GB"
Added a studio to reproduce the issue.
Code sample
Added a studio to reproduce the error.
Additional context
From the logs, it seems 4 processes are downloading the chunks but one deletes it before the other are finished with it.
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
Epoch 0: 1%|▍ | 280/20000 [05:13<6:08:00, 0.89it/s, v_num=10, train/loss=2.240, train/acc=0.124, train/f1=0.0799, train/recall=0.124, train/precision=0.0627]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
Epoch 0: 1%|▌ | 281/20000 [05:14<6:07:58, 0.89it/s, v_num=10, train/loss=2.210, train/acc=0.209, train/f1=0.124, train/recall=0.209, train/precision=0.116]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
Epoch 0: 1%|▍ | 282/20000 [05:15<6:07:54, 0.89it/s, v_num=10, train/loss=2.190, train/acc=0.130, train/f1=0.107, train/recall=0.130, train/precision=0.0993]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
Epoch 0: 1%|▌ | 283/20000 [05:16<6:07:52, 0.89it/s, v_num=10, train/loss=2.220, train/acc=0.105, train/f1=0.103, train/recall=0.105, train/precision=0.206]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
Epoch 0: 1%|▍ | 284/20000 [05:17<6:07:48, 0.89it/s, v_num=10, train/loss=2.250, train/acc=0.0921, train/f1=0.0709, train/recall=0.0921, train/precision=0.0658]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
Epoch 0: 1%|▍ | 285/20000 [05:18<6:07:45, 0.89it/s, v_num=10, train/loss=2.170, train/acc=0.120, train/f1=0.099, train/recall=0.120, train/precision=0.0995]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
Epoch 0: 1%|▍ | 286/20000 [05:20<6:07:42, 0.89it/s, v_num=10, train/loss=2.190, train/acc=0.102, train/f1=0.0932, train/recall=0.102, train/precision=0.0884]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
Epoch 0: 1%|▍ | 287/20000 [05:21<6:07:39, 0.89it/s, v_num=10, train/loss=2.210, train/acc=0.102, train/f1=0.0824, train/recall=0.102, train/precision=0.0897]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
Epoch 0: 1%|▌ | 289/20000 [05:23<6:07:33, 0.89it/s, v_num=10, train/loss=2.190, train/acc=0.140, train/f1=0.107, train/recall=0.140, train/precision=0.148]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
Epoch 0: 2%|▌ | 347/20000 [06:26<6:05:01, 0.90it/s, v_num=10, train/loss=2.240, train/acc=0.105, train/f1=0.0605, tEpoch 0: 2%| | 348/20000 [06:27<6:04:58, 0.90it/s, v_num=10, train/loss=2.240, train/acc=0.105, train/f1=0.0605, train/recall=0.105, train/precisioEpoch 0: 2%| | 360/20000 [06:40<6:04:32, 0.90it/s, v_num=10, train/loss=2.190, train/acc=0.138, train/f1=0.0883, train/recall=0.138, train/precisioTraceback (most recent call last):
File "/teamspace/studios/this_studio/train.py", line 107, in <module>
...
RuntimeError: Waiting too long for the /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin to be ready
Comment: When you are using multiple GPUs, avoid creating your datasets in the init method of the DataModule. (Support will be added in the future)
Hey @rakro101 do you think you could contribute an example with PyTorch Lightning to the repo ?
Hey @rakro101 do you think you could contribute an example with PyTorch Lightning to the repo ?
Looking forward to the examples!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Is this issue still relevant?
The issue was partially resolved with a workaround and by modifying the dataloader, if I remember correctly.: Comment: When you are using multiple GPUs, avoid creating your datasets in the init method of the DataModule. (Support will be added in the future). I can check next week and will report back.
Thanks @rakro101 — appreciate the update! Looking forward to it.
Solved using pl.LightningDataModule and define using def setup ...