DeepSpeed
DeepSpeed copied to clipboard
I meet a question about train
data:image/s3,"s3://crabby-images/78ed3/78ed318d090dbd05eedd4bb101cf36cfb69e6b1a" alt="屏幕截图 2023-04-19 163948"
data:image/s3,"s3://crabby-images/422a2/422a2ea716d0568628ccb8e702ec4cc070fd63ab" alt="屏幕截图 2023-04-19 164025"
@kuokay can you please monitor the system memory and GPU memory usage when you launch the script? I'm curious if this is caused by an OOM problem. You can monitor them with htop
and nvidia-smi
respectively.
Through monitoring, I found that the memory and video memory were not exceeded. I used wsl2 for training. I don't know whether the failure is related to wsl or not
---Original--- From: "Michael @.> Date: Thu, Apr 20, 2023 01:01 AM To: @.>; Cc: @.@.>; Subject: Re: [microsoft/DeepSpeed] I meet a question about train (Issue#3307)
@kuokay can you please monitor the system memory and GPU memory usage when you launch the script? I'm curious if this is caused by an OOM problem. You can monitor them with htop and nvidia-smi respectively.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Hi @kuokay, can you please provide more information about your setup?
ds_report output
Please run ds_report
to give us details about your setup.
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. two machines with x8 A100s each]
- (if applicable) Hugging Face Transformers/Accelerate/etc. versions
- Python version
- Any other relevant info about your setup
Closing. Please reopen if the issues persists.