DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

I meet a question about train

Open kuokay opened this issue 1 year ago • 3 comments

屏幕截图 2023-04-19 163948 屏幕截图 2023-04-19 164025

kuokay avatar Apr 19 '23 08:04 kuokay

@kuokay can you please monitor the system memory and GPU memory usage when you launch the script? I'm curious if this is caused by an OOM problem. You can monitor them with htop and nvidia-smi respectively.

mrwyattii avatar Apr 19 '23 17:04 mrwyattii

Through monitoring, I found that the memory and video memory were not exceeded. I used wsl2 for training. I don't know whether the failure is related to wsl or not

---Original--- From: "Michael @.> Date: Thu, Apr 20, 2023 01:01 AM To: @.>; Cc: @.@.>; Subject: Re: [microsoft/DeepSpeed] I meet a question about train (Issue#3307)

@kuokay can you please monitor the system memory and GPU memory usage when you launch the script? I'm curious if this is caused by an OOM problem. You can monitor them with htop and nvidia-smi respectively.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

kuokay avatar Apr 22 '23 00:04 kuokay

Hi @kuokay, can you please provide more information about your setup?

ds_report output Please run ds_report to give us details about your setup.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. two machines with x8 A100s each]
  • (if applicable) Hugging Face Transformers/Accelerate/etc. versions
  • Python version
  • Any other relevant info about your setup

molly-smith avatar May 12 '23 18:05 molly-smith

Closing. Please reopen if the issues persists.

molly-smith avatar May 26 '23 18:05 molly-smith