axolotl
axolotl copied to clipboard
DeepSpeed Zero3 is Incompatible with Freeze Range Code
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Set up a config like:
unfrozen_parameters:
- ^model.embed_tokens.weight$[128256:] # only train the new tokens
deepspeed: deepspeed_configs/zero3.json
Train.
Expect something like:
Unfrozen model.embed_tokens.weight with ranges [(128256, 130304)]
Got:
Unfrozen model.embed_tokens.weight with ranges [(128256, 0)]
This leads to things...not working as intended.
https://github.com/OpenAccess-AI-Collective/axolotl/pull/1686 will make diagnosis/recognition of this easier. But it doesn't fix the root problem.
AFAICT, the root problem is that deepspeed/zero3.json changes model loading such that the parameters no longer have their original shapes, like this:
>>> print(model.state_dict()["model.embed_tokens.weight"].shape)
torch.Size([0])
As a result, when range end is None, it gets set to 0.
(It also appears that this may mess with model saving as well. My saved models with deepspeed/zero3.json are way too small, possible because they have shape torch.Size([0]) for almost all layers.)
Current behaviour
see above
Steps to reproduce
see above
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
- [X] Linux
- [X] macOS
- [ ] Windows
Python Version
3.11
axolotl branch-commit
whatever docker image has (how do i get this from the docker image?)
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
Zero3 should handle frozen modules. per https://github.com/microsoft/DeepSpeed/pull/2653/files. Are we perhaps freezing/unfreezing too late after deepspeed has wrapped the model?
Zero3 should handle frozen modules.
I think the trouble is that range freezing relies on having shape information available, and once deepspeed has wrapped the model, that shape information is unavailable.
Are we perhaps freezing/unfreezing too late after deepspeed has wrapped the model?
That sounds plausible. (Might that also mean that deepspeed isn't as effective as it could be at memory usage?)
@winglian Got the same problem with a stage 1 config. Unfreezing an entire layer doesn't work (no gradient).
No problem with fsdp
Closing as stale.