DeepSpeed
DeepSpeed copied to clipboard
when I finetune the model use deepspeed on 2 4*A800s,log only contain worker1
when I finetune the model use deepspeed on 2 A800,log only contain worker1,no worker2. Is there any way to print the loss of Worker2? The GPUs on both machines are running normally, and the GPU memory is floating normally。 The script I use
The zero2.json I use
The log
@bill4689, do you know which code is generating those outputs? I don't believe it is DeepSpeed because DeepSpeed is unaware of epochs. Can you please try to locate the source of the outputs?
Regardless, you can pass --enable_each_rank_log <folder> to your deepspeed launch command to enable logs for each rank. You can invoke deepspeed -h to see all the launcher options.
@bill4689 - following up on this if you have any updates?
Closing for lack of response. Please feel to re-open as needed.