accelerate [FSDP] support activation offloading with FSDP

[FSDP] support activation offloading with FSDP

Open shijie-wu opened this issue 2 years ago • 4 comments

Support whole model activation offloading with FSDP - working in conjunction with activation checkpointing - via

https://github.com/pytorch/pytorch/blob/e9ebda29d87ce0916ab08c06ab26fd3766a870e5/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py#L171-L191

As apply_activation_checkpointing does not wrap the overall root module, wrapping the overall root module with this could offload activation between layer, thus release more GPU memory. The diff should be small and i am happy to work on this.

Oct 06 '23 20:10 shijie-wu

CC @pacman100

Oct 06 '23 20:10 muellerzr

The diff should be small and i am happy to work on this.

Hello @shijie-wu, thank you for bringing this to our notice. Any measure on how much GPU memory it saves and how much CPU memory usage goes up, also the hit on the throughput due to CPU <-> GPU data movement? Also, as you mentioned your interested to add this, looking forward to your PR 🤗.

Nov 08 '23 12:11 pacman100

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Dec 02 '23 15:12 github-actions[bot]

@muellerzr any chance this could be revived? Probably needs to gone in transformers though?

Sep 22 '24 05:09 winglian

accelerate accelerate copied to clipboard

[FSDP] support activation offloading with FSDP

accelerate
accelerate copied to clipboard