pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Too much time during next in dataloader

Open AnnaTrainingG opened this issue 1 year ago • 1 comments

Bug description

during training, I find that the next() in dataloader will spend 10~20s, I already set higher num_worker form 8 to 32, It's still spend long time in load data form dst to cpu, Can you give me more advice to solve this problem

What version are you seeing the problem on?

v2.1

How to reproduce the bug

Batchsize is 1280, num_worker from 8 to 32

Error messages and logs

# Error messages and logs here please

time spend: device : 0 , batch_idx: 33 , current_batch_cost_time耗时: 0.8724958896636963 device : 0 , batch_idx: 34 , current_batch_cost_time耗时: 0.857274055480957 device : 0 , batch_idx: 35 , current_batch_cost_time耗时: 0.8920135498046875 device : 0 , batch_idx: 36 , current_batch_cost_time耗时: 0.9099960327148438 device : 0 , batch_idx: 37 , current_batch_cost_time耗时: 0.8724958896636963 device : 0 , batch_idx: 38 , current_batch_cost_time耗时: 0.857274055480957 device : 0 , batch_idx: 39 , current_batch_cost_time耗时: 0.8920135498046875 device : 0 , batch_idx: 40 , current_batch_cost_time耗时: 2.0968360900878906 device : 0 , batch_idx: 41 , current_batch_cost_time耗时: 3.006289005279541 device : 0 , batch_idx: 42 , current_batch_cost_time耗时: 2.6699347496032715 device : 0 , batch_idx: 43 , current_batch_cost_time耗时: 1.8450953960418701 device : 0 , batch_idx: 44 , current_batch_cost_time耗时: 2.9467151165008545 device : 0 , batch_idx: 45 , current_batch_cost_time耗时: 1.0156996250152588 device : 0 , batch_idx: 46 , current_batch_cost_time耗时: 3.1411638259887695

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version 2.1 
#- PyTorch Version 2.1.3
#- Python version 3.9
#- OS (e.g., Linux): linux
#- CUDA/cuDNN version: 11.4
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @justusschock @awaelchli

AnnaTrainingG avatar Jan 30 '24 09:01 AnnaTrainingG

@niuliling123 Have you tried just iterating over the dataloader directly and measuring the time, without Lightning Trainer?

t0 = time.time()
for i, batch in enumerate(train_dataloader):
    print(i, time.time() - t0, "seconds")
    t0 = time.time()

If your dataloading code is just slow, then that will explain it. And the only option there is to optimize it. For general implementation help, I suggest posting in the forum or on our Discord.

awaelchli avatar Jan 31 '24 12:01 awaelchli

@niuliling123 Can you take a look at my reply?

awaelchli avatar Feb 05 '24 03:02 awaelchli

Due to lack of response, I'm closing this for now.

awaelchli avatar Feb 13 '24 03:02 awaelchli

t0 = time.time()
for i, batch in enumerate(train_dataloader):
    print(i, time.time() - t0, "seconds")
    t0 = time.time()    

from this, it‘s clearly that the dataloader cost too much time, and somtimes the cost time is very large l

AnnaTrainingG avatar Feb 28 '24 09:02 AnnaTrainingG

Can you measure the time again, once with DataLoader(..., num_workers=0) and once with DataLoader(..., num_workers=4)?

awaelchli avatar Feb 29 '24 12:02 awaelchli