examples icon indicating copy to clipboard operation
examples copied to clipboard

fsdp2 example unsharded_param.grad zero

Open Nju-Ben opened this issue 5 months ago • 0 comments

Your issue may already be reported! Please search on the issue tracker before creating one.

Context

  • Pytorch version: 2.6.0
  • Operating System and version: linux

Your Environment

  • Installed using source? [yes/no]: yes
  • Are you planning to deploy it using docker container? [yes/no]:no
  • Is it a CPU or GPU environment?:gpu
  • Which example are you using: fsdp2/examples/distributed/FSDP2/train.py
  • Link to code or data to repro [if any]:no

Expected Behavior

(1) normal loss like 1.95 1.86 1.73 etc. (2) unsharded_param.gard of module is not zero

Current Behavior

(1) abnormal loss -13857836160.0, -15615669120.0 , -17379222400.0 (2) unsharded_param.gard of module is zero in every layer when i user logger to debug.

Possible Solution

Steps to Reproduce

1.run case fsdp2/examples/distributed/FSDP2/train.py 2. ...

Failure Logs [if any]

Nju-Ben avatar Jul 26 '25 10:07 Nju-Ben