examples icon indicating copy to clipboard operation
examples copied to clipboard

FSDP example

Open HamidShojanazeri opened this issue 3 years ago • 4 comments

This example shows training a HF T5 model with FSDP to be used with its tutorial

HamidShojanazeri avatar Jul 07 '22 20:07 HamidShojanazeri

Deploy Preview for pytorch-examples-preview canceled.

Name Link
Latest commit c15c6897066d5a42fff122baac9a49f8a1b87aad
Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/62c741fbf9c2cc00089990df

netlify[bot] avatar Jul 07 '22 20:07 netlify[bot]

@rohan-varma @lessw2020 @HamidShojanazeri once you tell @hudeven and I that you'd like to merge the PR let us know. This has been open for a while. Feel free to close any feedback you don't believe is relevant

msaroufim avatar Sep 22 '22 15:09 msaroufim

Let me review - I was not even aware this PR existed until today, so thanks for the direct link.

lessw2020 avatar Sep 22 '22 21:09 lessw2020

General comment - this example does not use activation checkpointing due to the timing of this PR (it wasn't added in FSDP until after this PR).
But I think it would be good to update this example with it, to make sure it's present as activation checkpointing is one of our biggest throughput boosters.

lessw2020 avatar Sep 22 '22 22:09 lessw2020

Deploy Preview for pytorch-examples-preview canceled.

Name Link
Latest commit f62b4aec7bff832fc65b59aa62d60e94ddd6b39e
Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/646e47182d49400008c6a694

netlify[bot] avatar May 24 '23 06:05 netlify[bot]

@msaroufim , @hudeven sorry for the delay I addressed the comments and made the code more modular, would be great if we could merge this.

HamidShojanazeri avatar May 24 '23 06:05 HamidShojanazeri

General comment - this example does not use activation checkpointing due to the timing of this PR (it wasn't added in FSDP until after this PR). But I think it would be good to update this example with it, to make sure it's present as activation checkpointing is one of our biggest throughput boosters.

Added the AC and rate_lmiter as well+ model checkpointings.

HamidShojanazeri avatar May 24 '23 06:05 HamidShojanazeri

@svekars any idea if the doc build is flaking for any reason?

@HamidShojanazeri do you mind rebasing on main to see if the error goes away

msaroufim avatar May 24 '23 16:05 msaroufim