examples
examples copied to clipboard
fsdp2 example unsharded_param.grad zero
Your issue may already be reported! Please search on the issue tracker before creating one.
Context
- Pytorch version: 2.6.0
- Operating System and version: linux
Your Environment
- Installed using source? [yes/no]: yes
- Are you planning to deploy it using docker container? [yes/no]:no
- Is it a CPU or GPU environment?:gpu
- Which example are you using: fsdp2/examples/distributed/FSDP2/train.py
- Link to code or data to repro [if any]:no
Expected Behavior
(1) normal loss like 1.95 1.86 1.73 etc. (2) unsharded_param.gard of module is not zero
Current Behavior
(1) abnormal loss -13857836160.0, -15615669120.0 , -17379222400.0 (2) unsharded_param.gard of module is zero in every layer when i user logger to debug.
Possible Solution
Steps to Reproduce
1.run case fsdp2/examples/distributed/FSDP2/train.py 2. ...