DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Error with nn.transformers layer size with Zero stage 3

Open q737645224 opened this issue 1 year ago • 1 comments

Describe the bug A clear and concise description of what the bug is. The size of the nn.Transformers layer does not match, and parameters cannot be loaded after stage3 is used。But stage2 is capable of loading parameters normally

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen. image

ds_report output Please run ds_report to give us details about your setup. image

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. two machines with x8 A100s each]
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version
  • Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

q737645224 avatar May 17 '24 07:05 q737645224

Can you please add a title?

loadams avatar May 17 '24 17:05 loadams

Closing due to lack of response. Please reopen if needed.

tjruwase avatar Aug 03 '24 17:08 tjruwase