DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Deepspeed Hangs during Initialization

Open griff4692 opened this issue 1 year ago • 7 comments

Hi - I am having an occasional issue with PyTorch Lightning's Trainer using strategy='deepspeed_stage_2' where the script just hangs while initializing deepspeed. I have no insight into what is going on or where the error occurs other than I can provide the console log up to the point where it hangs forever.

CUDA_VISIBLE_DEVICES=3 python main.py --experiment gsum_embed
Set random, numpy and torch seeds to 1992
Adding <doc-sep> as an additional special token...
Loading pre-trained model from /home/ga2530/bhc_weights/led_final/pytorch_model.bin...
Reading in dataset...
Reading in data from /nlp/projects/summarization/note_partials
wandb: Currently logged in as: griffinadams (clinsum). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.15.0 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.13.10
wandb: Run data is saved locally in /home/ga2530/bhc_weights/gsum_embed/wandb/run-20230427_161843-1xhtm42d
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run gsum_embed
wandb: ⭐️ View project at https://wandb.ai/clinsum/bhc_sum
wandb: 🚀 View run at https://wandb.ai/clinsum/bhc_sum/runs/1xhtm42d

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Starting training...
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Enabling DeepSpeed FP16.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [3]
Using /home/ga2530/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...

Does anyone have any idea? I'm using the latest version of pytorch lightning and deepspeed

deepspeed==0.9.1
lightning-utilities==0.8.0
pytorch-lightning==2.0.2

This is my PyTorch Lightning Trainer. It happens with all the the deepspeed stages. Any thoughts on how to debug or find out why/where it's hanging?

trainer = pl.Trainer(
        callbacks=callbacks,
        max_steps=args.max_steps,
        accumulate_grad_batches=args.grad_accum,
        logger=logger,
        precision=32 if args.cpu else '16-mixed',
        accelerator='cpu' if args.cpu else 'gpu',
        strategy='auto' if args.debug else 'deepspeed_stage_1',
        devices='auto',
        default_root_dir=experiment_dir,
        gradient_clip_val=0.1,
        val_check_interval=0.05,
        limit_val_batches=2 if args.debug else 0.1,
        num_sanity_val_steps=2,
        log_every_n_steps=5,
    )

I should add that the hanging behavior occurs somewhat randomly so I'm wondering if it's a network issue with the server I am on.

griff4692 avatar Apr 30 '23 22:04 griff4692

I'm facing the same issue where it hangs at "Enabling DeepSpeed FP16." It is happening at random, and I'm unable to pin-point any one factor that is causing it.

prabhuteja12 avatar May 02 '23 15:05 prabhuteja12

I'm facing the same issue where it hangs at "Enabling DeepSpeed FP16." It is happening at random, and I'm unable to pin-point any one factor that is causing it.

Thanks for sharing - Hopefully we can resolve it.

griff4692 avatar May 02 '23 18:05 griff4692

Which optimizer is used in your code? I met a same issue that the code hangs when I use the deepspped FusedAdam optimizer, it seems that the builder hangs, but I do not chatch the reason yet.

lipiji avatar May 03 '23 04:05 lipiji

I think I understand when this happens. When a previous training run crashes and for some reason doesn't kill the python processes, then subsequent runs hang. Solution seems to be manually killing the python processes.

I use torch SGD and it hangs then as well.

prabhuteja12 avatar May 03 '23 07:05 prabhuteja12

Re optimizer. I am using regular AdamW from torch and when using cpu offloading (stage_2_offload) I am using CPU Deepspeed Adam

griff4692 avatar May 03 '23 11:05 griff4692

I think I understand when this happens. When a previous training run crashes and for some reason doesn't kill the python processes, then subsequent runs hang. Solution seems to be manually killing the python processes.

I use torch SGD and it hangs then as well.

Thanks - I don’t think my issue is this because when it happens on the first run and monitoring nvidia-smi shows no processes running before I start the script.

griff4692 avatar May 03 '23 11:05 griff4692

nvidia-smi doesn't show the processes but ps-ef | grep python shows those processes for me.

prabhuteja12 avatar May 03 '23 11:05 prabhuteja12

https://github.com/microsoft/DeepSpeed/issues/2816

I was able to resolve the issue by removing the .cache directory based on the above link

rm -rf /home/ga2530/.cache/torch_extensions/py310_cu116

griff4692 avatar May 04 '23 15:05 griff4692

I'm still getting this issue and I can't even find the torch_extensions or the py..._cu... file or directory

miguelscarv avatar Jan 11 '24 19:01 miguelscarv

I'm still getting this issue and I can't even find the torch_extensions or the py..._cu... file or directory

+1

daixiangzi avatar Feb 02 '24 14:02 daixiangzi