transformers
transformers copied to clipboard
Place static llama variables for multigpu
What does this PR do?
When using accelerate, attention_mask and position_ids were being retransferred for every layer after the first device. This change transfers them once in advance.
Who can review?
@ArthurZucker @pacman100
Not really in favor of this, if the problem is with the way accelerate handles a for loop, should be solved in accelerate. cc @sgugger
I am imagining accelerate might be able to store a cache of variables by object id so as to not repeatedly transfer the same one. When to empty such a cache is unclear to me.
The documentation is not available anymore as the PR was closed or merged.
This is already done by Accelerate behind the scenes, so there is no need for this PR.
@sgugger Accelerate moves the weights prior to the model forward function. Since attention_mask
and position_ids
(unlike hidden_states
) are never returned back from a forward function, it moves them again and again for every layer.
I'm not sure the actual time you lose for that is worth changing the code in Transformers however.
I’m on a system with iommu=soft where data transfer is very slow. I wanted to provide numbers for the speed change on my system, which was significant enough that I opened this PR, before closing it out. However, I am busy for a day or two.
Regardless it is clear that you would prefer a solution be found for accelerate than transformers. I opened this after seeing the various PP commits adding similar code, although they are addressing a more serious issue.
I’ll come back to this to add my numbers or feel free to close it out for now.
Running llama 65b with software iommu, this change drops my inference time from 25.11 s/token to 19.45 s/token, which is 22.5% of the inference delay. Thoughts?
I discussed it more internally with other core maintainers and we decided this falls into specific-hardware optimizations that we don't accept to avoid bloating the code of the models. You can still use your changes locally and share them with others via our code in the Hub API though.
Thanks for your consideration and ideas of other approaches.