transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Place static llama variables for multigpu

Open xloem opened this issue 1 year ago • 9 comments

What does this PR do?

When using accelerate, attention_mask and position_ids were being retransferred for every layer after the first device. This change transfers them once in advance.

Who can review?

@ArthurZucker @pacman100

xloem avatar Apr 21 '23 11:04 xloem

Not really in favor of this, if the problem is with the way accelerate handles a for loop, should be solved in accelerate. cc @sgugger

ArthurZucker avatar Apr 21 '23 11:04 ArthurZucker

I am imagining accelerate might be able to store a cache of variables by object id so as to not repeatedly transfer the same one. When to empty such a cache is unclear to me.

xloem avatar Apr 21 '23 11:04 xloem

The documentation is not available anymore as the PR was closed or merged.

This is already done by Accelerate behind the scenes, so there is no need for this PR.

sgugger avatar Apr 21 '23 13:04 sgugger

@sgugger Accelerate moves the weights prior to the model forward function. Since attention_mask and position_ids (unlike hidden_states) are never returned back from a forward function, it moves them again and again for every layer.

xloem avatar Apr 21 '23 13:04 xloem

I'm not sure the actual time you lose for that is worth changing the code in Transformers however.

sgugger avatar Apr 21 '23 13:04 sgugger

I’m on a system with iommu=soft where data transfer is very slow. I wanted to provide numbers for the speed change on my system, which was significant enough that I opened this PR, before closing it out. However, I am busy for a day or two.

Regardless it is clear that you would prefer a solution be found for accelerate than transformers. I opened this after seeing the various PP commits adding similar code, although they are addressing a more serious issue.

I’ll come back to this to add my numbers or feel free to close it out for now.

xloem avatar Apr 22 '23 11:04 xloem

Running llama 65b with software iommu, this change drops my inference time from 25.11 s/token to 19.45 s/token, which is 22.5% of the inference delay. Thoughts?

xloem avatar Apr 23 '23 12:04 xloem

I discussed it more internally with other core maintainers and we decided this falls into specific-hardware optimizations that we don't accept to avoid bloating the code of the models. You can still use your changes locally and share them with others via our code in the Hub API though.

sgugger avatar Apr 25 '23 13:04 sgugger

Thanks for your consideration and ideas of other approaches.

xloem avatar Apr 25 '23 19:04 xloem