OLMo
OLMo copied to clipboard
FSDP Overlap Investigation
The problem is that on LUMI, FSDP doesn't overlap computation and communication like it should. Evidence comes from this profiler trace:
It may be noteworthy that the NCCL GPU stream does not have an ID:
The problem persists on LUMI with torch 2.0.1, torch 2.1, and torch-nightly (to be torch 2.2) on 24 nodes.
There is no problem on a single 4x A100 node in Cirrascale.
Problem is also visible on a single AMD node in AAC.
I have created a script that appears to repro the issue. The script sets up 2 GPUs and 2 streams on each GPU. Each GPU performs a matrix multiplication on 1 stream and an all_gather
on the other, and then everything is synchronized. This is done for 6 iteration, and the last iteration is recorded. There is still scope to simply the script, but overall the script is fairly basic.
On beaker (NVIDIA GPUs), computation and communication appear to overlap correctly.
On LUMI (AMD GPUs), it looks like computation and communication are not overlapping.
As noted earlier, there is an NCCL GPU stream without an ID in the AMD case. I am not sure what the "Marker"s are, but I am inclined to believe that they are not handling the communication.
Can you run this again with torch 21, so we get a cleaner profiler trace? With torch 20, it inserts all those "marker" blocks.
Also, can you change the script in two ways:
- Create two streams up front, and then keep using the same streams (instead of creating a new stream every batch)
- At the end of each batch, print out the first element of each tensor. This is to force the code to wait and synchronize both streams at the end of each batch. What you have here is already quite good, and probably enough. These two extra things are just bonus, for safety.
I want to send this off to AMD ASAP. This repro is gold.
It seems that implementing those 2 suggestions breaks the repro somehow. Torch 2.1 LUMI:
Torch 2.0 LUMI:
It turns out that re-using the streams causes the overlap issue to go away. Torch 2.1 LUMI with no stream re-use:
I'll see if I can figure more out regarding this.
That suggests a fix then?
I have made some further discoveries. The important ones:
- In my repro, if computation and communication each have dedicated non-default streams then overlap happens correctly. If the computation stream context is active when the communication stream context is activated, the overlap still happens correctly. If computation instead uses the default stream (and communication still has its own dedicated non-default stream), then overlap does not happen on LUMI. This is a bug as far as I know.
- I added a 'computation stream' at the start of our training code and observed the above phenomenon. Unfortunately, using a dedicated computation stream caused throughput to drop from 8500 tokens/dev/sec to 2000 tokens/dev/sec. Thus, the benefit of overlap is being negated by some other issue with using a dedicated computation stream.
- From what I can glean from perf profiles, I think that the communication is trivial compared to the computation for the 1B. I suspect that getting the desired overlap will have a negligible benefit (≤ 5%) to our performance.
At this point I believe that the cause of the overlap issue is neither in our code nor in FSDP. It could be due to AMD-specific logic in torch or related to AMD GPUs at some other level. We can send my repro script to AMD, which currently is set to the case where communication has a dedicated non-default stream and computation uses the default stream.
@2015aroras @dirkgr great thread! Is there any update you could provide on this issue? Did you trace it back to something in the AMD stack?
Marking the items prior to Feb 29th as "closed".