ParlAI
ParlAI copied to clipboard
FSDP Issues Tracker
Description Tracking known issues during training with FSDP.
- Issue with resizing embedding dimensions in distributed train
- Behavior: This throws an exception with embedding sizes out of bound
- Repro: Train models with
--ddp-backend zero2and setting--special-tok-lst
- T5 model parallel incompatible with zero2 ddp-backend (possible this affects other HuggingFace agents?)
- Behavior: thread seems to hang indefinitely
- Repro: Train models with
--t5-model-paralleland--ddp-backend zero2
- FiD does not work with FSDP and batchsize > 1 (see #4531)
Related fix: #4505
This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.