pytorch-lightning
pytorch-lightning copied to clipboard
Too much time during next in dataloader
Bug description
during training, I find that the next() in dataloader will spend 10~20s, I already set higher num_worker form 8 to 32, It's still spend long time in load data form dst to cpu, Can you give me more advice to solve this problem
What version are you seeing the problem on?
v2.1
How to reproduce the bug
Batchsize is 1280, num_worker from 8 to 32
Error messages and logs
# Error messages and logs here please
time spend: device : 0 , batch_idx: 33 , current_batch_cost_time耗时: 0.8724958896636963 device : 0 , batch_idx: 34 , current_batch_cost_time耗时: 0.857274055480957 device : 0 , batch_idx: 35 , current_batch_cost_time耗时: 0.8920135498046875 device : 0 , batch_idx: 36 , current_batch_cost_time耗时: 0.9099960327148438 device : 0 , batch_idx: 37 , current_batch_cost_time耗时: 0.8724958896636963 device : 0 , batch_idx: 38 , current_batch_cost_time耗时: 0.857274055480957 device : 0 , batch_idx: 39 , current_batch_cost_time耗时: 0.8920135498046875 device : 0 , batch_idx: 40 , current_batch_cost_time耗时: 2.0968360900878906 device : 0 , batch_idx: 41 , current_batch_cost_time耗时: 3.006289005279541 device : 0 , batch_idx: 42 , current_batch_cost_time耗时: 2.6699347496032715 device : 0 , batch_idx: 43 , current_batch_cost_time耗时: 1.8450953960418701 device : 0 , batch_idx: 44 , current_batch_cost_time耗时: 2.9467151165008545 device : 0 , batch_idx: 45 , current_batch_cost_time耗时: 1.0156996250152588 device : 0 , batch_idx: 46 , current_batch_cost_time耗时: 3.1411638259887695
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version 2.1
#- PyTorch Version 2.1.3
#- Python version 3.9
#- OS (e.g., Linux): linux
#- CUDA/cuDNN version: 11.4
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response
cc @justusschock @awaelchli
@niuliling123 Have you tried just iterating over the dataloader directly and measuring the time, without Lightning Trainer?
t0 = time.time()
for i, batch in enumerate(train_dataloader):
print(i, time.time() - t0, "seconds")
t0 = time.time()
If your dataloading code is just slow, then that will explain it. And the only option there is to optimize it. For general implementation help, I suggest posting in the forum or on our Discord.
@niuliling123 Can you take a look at my reply?
Due to lack of response, I'm closing this for now.
t0 = time.time()
for i, batch in enumerate(train_dataloader):
print(i, time.time() - t0, "seconds")
t0 = time.time()
from this, it‘s clearly that the dataloader cost too much time, and somtimes the cost time is very large l
Can you measure the time again, once with DataLoader(..., num_workers=0)
and once with DataLoader(..., num_workers=4)
?