DeepSpeed
DeepSpeed copied to clipboard
Deepspeed Hangs during Initialization
Hi - I am having an occasional issue with PyTorch Lightning's Trainer using strategy='deepspeed_stage_2' where the script just hangs while initializing deepspeed. I have no insight into what is going on or where the error occurs other than I can provide the console log up to the point where it hangs forever.
CUDA_VISIBLE_DEVICES=3 python main.py --experiment gsum_embed
Set random, numpy and torch seeds to 1992
Adding <doc-sep> as an additional special token...
Loading pre-trained model from /home/ga2530/bhc_weights/led_final/pytorch_model.bin...
Reading in dataset...
Reading in data from /nlp/projects/summarization/note_partials
wandb: Currently logged in as: griffinadams (clinsum). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.15.0 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.13.10
wandb: Run data is saved locally in /home/ga2530/bhc_weights/gsum_embed/wandb/run-20230427_161843-1xhtm42d
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run gsum_embed
wandb: ⭐️ View project at https://wandb.ai/clinsum/bhc_sum
wandb: 🚀 View run at https://wandb.ai/clinsum/bhc_sum/runs/1xhtm42d
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Starting training...
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Enabling DeepSpeed FP16.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [3]
Using /home/ga2530/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Does anyone have any idea? I'm using the latest version of pytorch lightning and deepspeed
deepspeed==0.9.1
lightning-utilities==0.8.0
pytorch-lightning==2.0.2
This is my PyTorch Lightning Trainer. It happens with all the the deepspeed stages. Any thoughts on how to debug or find out why/where it's hanging?
trainer = pl.Trainer(
callbacks=callbacks,
max_steps=args.max_steps,
accumulate_grad_batches=args.grad_accum,
logger=logger,
precision=32 if args.cpu else '16-mixed',
accelerator='cpu' if args.cpu else 'gpu',
strategy='auto' if args.debug else 'deepspeed_stage_1',
devices='auto',
default_root_dir=experiment_dir,
gradient_clip_val=0.1,
val_check_interval=0.05,
limit_val_batches=2 if args.debug else 0.1,
num_sanity_val_steps=2,
log_every_n_steps=5,
)
I should add that the hanging behavior occurs somewhat randomly so I'm wondering if it's a network issue with the server I am on.
I'm facing the same issue where it hangs at "Enabling DeepSpeed FP16." It is happening at random, and I'm unable to pin-point any one factor that is causing it.
I'm facing the same issue where it hangs at "Enabling DeepSpeed FP16." It is happening at random, and I'm unable to pin-point any one factor that is causing it.
Thanks for sharing - Hopefully we can resolve it.
Which optimizer is used in your code? I met a same issue that the code hangs when I use the deepspped FusedAdam optimizer, it seems that the builder hangs, but I do not chatch the reason yet.
I think I understand when this happens. When a previous training run crashes and for some reason doesn't kill the python processes, then subsequent runs hang. Solution seems to be manually killing the python processes.
I use torch SGD and it hangs then as well.
Re optimizer. I am using regular AdamW from torch and when using cpu offloading (stage_2_offload) I am using CPU Deepspeed Adam
I think I understand when this happens. When a previous training run crashes and for some reason doesn't kill the python processes, then subsequent runs hang. Solution seems to be manually killing the python processes.
I use torch SGD and it hangs then as well.
Thanks - I don’t think my issue is this because when it happens on the first run and monitoring nvidia-smi shows no processes running before I start the script.
nvidia-smi
doesn't show the processes but ps-ef | grep python
shows those processes for me.
https://github.com/microsoft/DeepSpeed/issues/2816
I was able to resolve the issue by removing the .cache directory based on the above link
rm -rf /home/ga2530/.cache/torch_extensions/py310_cu116
I'm still getting this issue and I can't even find the torch_extensions
or the py..._cu...
file or directory
I'm still getting this issue and I can't even find the
torch_extensions
or thepy..._cu...
file or directory
+1