DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Multi-GPU Training frozen after finishing first epoch

Open drizzle0171 opened this issue 1 year ago • 5 comments

Describe the bug Hello, I am training VAE. The resolution of the data is large, so I've been using deepspeed recently. However, with deepspeed, all iteration of the first epoch rotates, but it doesn't go to the next epoch. Even after a few hours, I stay in the first epoch like below. However, if I reduce the number of GPUs I use to one, it goes straight to the next epoch. I think it seems to be a communication problem between GPUs as the epoch goes over, so how can we solve this problem?

Screenshots image image As you can see, GPU-Utils are all 100%.

System info (please complete the following information):

  • GPU count and types: four ~ eight machines A100s

drizzle0171 avatar Nov 27 '23 01:11 drizzle0171

I meet the same question. I think you can first add more specific logs to find the question before running your training command : export TORCH_DISTRIBUTED_DEBUG=DETAIL export DEEPSPEED_LOG_LEVEL=debug export OMPI_MCA_btl_base_verbose=1 export NCCL_DEBUG=INFO export NCCL_DEBUG_SUBSYS=ALL export TORCH_CPP_LOG_LEVEL=INFO

HUAFOR avatar Dec 12 '23 03:12 HUAFOR

I'm training a diffusion pipeline and using the deepspeed-stage2 in 8 A100 GPUS. When training the first epoch ,everything goes well, however, when training the second epoch, the process is forzen, so I add some debug logs to find the question: export TORCH_DISTRIBUTED_DEBUG=DETAIL export DEEPSPEED_LOG_LEVEL=debug export OMPI_MCA_btl_base_verbose=1 export NCCL_DEBUG=INFO export NCCL_DEBUG_SUBSYS=ALL export TORCH_CPP_LOG_LEVEL=INFO after add the debug logs, the process are no longer forzen for hours, but quickly suffer an error.

In my case: basemodel + deepspeed can be train basemodel + B custom neural module + deepspeed can not be train and frozen in the second epoch. if I closed the deepspeed, basemodel + B custom neural module can be train .

HUAFOR avatar Dec 12 '23 03:12 HUAFOR

Thank you for sharing your exprience. Actually, in my case, the problem was caused by the logger.

For reference, the environment set up for training was as follows.

  • pytorch lightning 1.7.7
  • using WandbLogger implementated by pytorch lightning

To cut to the chase, if I used deepspeed without adding the argument rank_zero_only=True to self.log(), it would stop after the first epoch, which is the problem I mentioned. So I added that argument to self.log, and after that, if worked fine.

To summarize,

# before
self.log("[Train] Total Loss", 
		aeloss, 
		prog_bar=False, 
		logger=True, 
		on_step=True, 
		on_epoch=True, 
		sync_dist=True, 
				)
# after
self.log("[Train] Total Loss", 
		aeloss, 
		prog_bar=False, 
		logger=True, 
		on_step=True, 
		on_epoch=True, 
		sync_dist=True, 
		rank_zero_only=True # added argument
       )

Additionally, the link below helped me solve my problem. https://github.com/Lightning-AI/pytorch-lightning/issues/11242

drizzle0171 avatar Dec 12 '23 05:12 drizzle0171

Thank you for your sharing, however, it doesn't work for my case/(ㄒoㄒ)/~~. Anyway, Thanks!

HUAFOR avatar Dec 13 '23 02:12 HUAFOR

I'm also having the same problem. In trainer.test - after fix number of examples(images) it's getting stuck. I have tried to change batch size to half - and it got stuck at same number of images, I have tried to change the images i'm using - same problem. Using rtx_8000 / rtx_6000 latest pytorch-lightning(pytorch-lightning==2.1.2)

working locally on the HICO data set: https://websites.umich.edu/~ywchao/hico/

itsik1 avatar Dec 20 '23 15:12 itsik1