DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

when I finetune the model use deepspeed on 2 4*A800s,log only contain worker1

Open bill4689 opened this issue 1 year ago • 2 comments

when I finetune the model use deepspeed on 2 A800,log only contain worker1,no worker2. Is there any way to print the loss of Worker2? The GPUs on both machines are running normally, and the GPU memory is floating normally。 The script I use

image

The zero2.json I use

image

The log image

bill4689 avatar Jul 30 '24 10:07 bill4689

@bill4689, do you know which code is generating those outputs? I don't believe it is DeepSpeed because DeepSpeed is unaware of epochs. Can you please try to locate the source of the outputs?

Regardless, you can pass --enable_each_rank_log <folder> to your deepspeed launch command to enable logs for each rank. You can invoke deepspeed -h to see all the launcher options.

tjruwase avatar Aug 03 '24 10:08 tjruwase

@bill4689 - following up on this if you have any updates?

loadams avatar Aug 13 '24 23:08 loadams

Closing for lack of response. Please feel to re-open as needed.

tjruwase avatar Dec 09 '24 17:12 tjruwase