torchtune v0.3 regression, full_finetune

The recipe full_finetune_distributed Appear to be much slower in v0.3 than v0.2.1

Everything seems to work as usual, but my job that used to work in v0.2.1 time out in v0.3.0.

I don't have much detail yet, but maybe as you are more familiar with the code base you could have an idea already based on what changed recently!

Sep 30 '24 14:09 Delaunay

Can you share a few more details around which models you're using, size of dataset, machine type?

Off the very top of my head, not sure what would be going on.

Sep 30 '24 15:09 joecummings

I tried on

8xA100 and 8xH100,
model is torchtune.models.llama3_1.llama3_1_70b
dataset is torchtune.datasets.alpaca_dataset

Sep 30 '24 18:09 Delaunay

Hey @Delaunay - I looked into this and was able to repro! Unfortunately, still digging into the root cause, but a quick fix is to upgrade your PyTorch version to the nightlies.

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/

After doing this, training should be fast again: Screenshot 2024-10-02 at 1 01 42 PM

Oct 02 '24 12:10 joecummings

I think @felipemello1 added some warnings about this in #1733, seems like the fix is to run on PyTorch nightlies here. Given that we have a resolution for this I am gonna close the issue, but @Delaunay if you are not unblocked please feel free to reopen.

Oct 08 '24 04:10 ebsmothers

I downgraded to 0.2.1 while I wait for pytorch to release a new version, in my case I cannot use nightlies.

But if it OOM why did I not see the error being raised ? Is it linked to the CPU offloading where things get moved out to CPU to avoid OOM and it gets super slow but eventually OOM ?

Oct 08 '24 15:10 Delaunay

@Delaunay since you can't use the nightlies I'll reopen this. The main change is that between 0.2.1 and 0.3.0 we moved onto FSDP2. Since this is a relatively new feature, it's likely that there have been some optimizations made since 2.4 (I can't pinpoint anything offhand but can dig in further here). Out of curiosity, where is it slow? Is it during training, checkpoint load, or somewhere else? And do you see similar slowdowns when running on smaller models (i.e. ones where we aren't doing CPU offload)?

Oct 09 '24 15:10 ebsmothers

This can be closed b/c we have a new stable release that should fix this issue.

Dec 10 '24 11:12 joecummings

v0.3 regression, full_finetune_distributed slower ?