litdata icon indicating copy to clipboard operation
litdata copied to clipboard

Data shard delation with multi GPU does not work

Open rakro101 opened this issue 1 year ago • 5 comments

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Create a litdata set, stream the shard (image 224,224,3 + some text) and using mutli GPU using Bert + Resnet setting the max_cache_size="6GB"

Added a studio to reproduce the issue.

Code sample

Added a studio to reproduce the error.

Additional context

rakro101 avatar May 24 '24 09:05 rakro101

From the logs, it seems 4 processes are downloading the chunks but one deletes it before the other are finished with it.

DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-30-16.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-29-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-33-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-19.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-36-7.bin
Sanity Checking DataLoader 0:   0%|                                                                                                                                       | 0/2 [00:00<?, ?it/s]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-6-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-7.bin
Epoch 0:   1%|▍                                 | 280/20000 [05:13<6:08:00,  0.89it/s, v_num=10, train/loss=2.240, train/acc=0.124, train/f1=0.0799, train/recall=0.124, train/precision=0.0627]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
Epoch 0:   1%|▌                                   | 281/20000 [05:14<6:07:58,  0.89it/s, v_num=10, train/loss=2.210, train/acc=0.209, train/f1=0.124, train/recall=0.209, train/precision=0.116]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-19-17.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
Epoch 0:   1%|▍                                  | 282/20000 [05:15<6:07:54,  0.89it/s, v_num=10, train/loss=2.190, train/acc=0.130, train/f1=0.107, train/recall=0.130, train/precision=0.0993]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-32-4.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-5-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-0-15.bin
Epoch 0:   1%|▌                                   | 283/20000 [05:16<6:07:52,  0.89it/s, v_num=10, train/loss=2.220, train/acc=0.105, train/f1=0.103, train/recall=0.105, train/precision=0.206]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-18-14.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
Epoch 0:   1%|▍                               | 284/20000 [05:17<6:07:48,  0.89it/s, v_num=10, train/loss=2.250, train/acc=0.0921, train/f1=0.0709, train/recall=0.0921, train/precision=0.0658]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-21-9.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-9-11.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-22.bin
Epoch 0:   1%|▍                                  | 285/20000 [05:18<6:07:45,  0.89it/s, v_num=10, train/loss=2.170, train/acc=0.120, train/f1=0.099, train/recall=0.120, train/precision=0.0995]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-2-8.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
Epoch 0:   1%|▍                                 | 286/20000 [05:20<6:07:42,  0.89it/s, v_num=10, train/loss=2.190, train/acc=0.102, train/f1=0.0932, train/recall=0.102, train/precision=0.0884]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-27-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-15-15.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
Epoch 0:   1%|▍                                 | 287/20000 [05:21<6:07:39,  0.89it/s, v_num=10, train/loss=2.210, train/acc=0.102, train/f1=0.0824, train/recall=0.102, train/precision=0.0897]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
DELETING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-35-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-3-10.bin
DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-11-10.bin
Epoch 0:   1%|▌                                   | 289/20000 [05:23<6:07:33,  0.89it/s, v_num=10, train/loss=2.190, train/acc=0.140, train/f1=0.107, train/recall=0.140, train/precision=0.148]DOWNLOADING /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-17-21.bin
Epoch 0:   2%|▌                                 | 347/20000 [06:26<6:05:01,  0.90it/s, v_num=10, train/loss=2.240, train/acc=0.105, train/f1=0.0605, tEpoch 0:   2%| | 348/20000 [06:27<6:04:58,  0.90it/s, v_num=10, train/loss=2.240, train/acc=0.105, train/f1=0.0605, train/recall=0.105, train/precisioEpoch 0:   2%| | 360/20000 [06:40<6:04:32,  0.90it/s, v_num=10, train/loss=2.190, train/acc=0.138, train/f1=0.0883, train/recall=0.138, train/precisioTraceback (most recent call last):
  File "/teamspace/studios/this_studio/train.py", line 107, in <module>
...
RuntimeError: Waiting too long for the /cache/chunks/2539bee168b4ea4262fd6320b47c8288/chunk-24-0.bin to be ready

tchaton avatar May 24 '24 10:05 tchaton

Comment: When you are using multiple GPUs, avoid creating your datasets in the init method of the DataModule. (Support will be added in the future)

rakro101 avatar May 24 '24 13:05 rakro101

Hey @rakro101 do you think you could contribute an example with PyTorch Lightning to the repo ?

tchaton avatar May 26 '24 13:05 tchaton

Hey @rakro101 do you think you could contribute an example with PyTorch Lightning to the repo ?

Looking forward to the examples!

deeptimhe avatar Jun 01 '24 06:06 deeptimhe

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '25 05:04 stale[bot]

Is this issue still relevant?

bhimrazy avatar Jun 15 '25 14:06 bhimrazy

The issue was partially resolved with a workaround and by modifying the dataloader, if I remember correctly.: Comment: When you are using multiple GPUs, avoid creating your datasets in the init method of the DataModule. (Support will be added in the future). I can check next week and will report back.

rakro101 avatar Jun 15 '25 19:06 rakro101

Thanks @rakro101 — appreciate the update! Looking forward to it.

bhimrazy avatar Jun 16 '25 02:06 bhimrazy

Solved using pl.LightningDataModule and define using def setup ...

rakro101 avatar Jun 16 '25 06:06 rakro101