ParlAI icon indicating copy to clipboard operation
ParlAI copied to clipboard

FSDP Issues Tracker

Open Rebecca-Qian opened this issue 3 years ago • 2 comments

Description Tracking known issues during training with FSDP.

  • Issue with resizing embedding dimensions in distributed train
    • Behavior: This throws an exception with embedding sizes out of bound
    • Repro: Train models with --ddp-backend zero2 and setting --special-tok-lst
  • T5 model parallel incompatible with zero2 ddp-backend (possible this affects other HuggingFace agents?)
    • Behavior: thread seems to hang indefinitely
    • Repro: Train models with --t5-model-parallel and --ddp-backend zero2
  • FiD does not work with FSDP and batchsize > 1 (see #4531)

Rebecca-Qian avatar Apr 27 '22 22:04 Rebecca-Qian

Related fix: #4505

Rebecca-Qian avatar May 02 '22 18:05 Rebecca-Qian

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

github-actions[bot] avatar Jun 17 '22 00:06 github-actions[bot]