What does this PR do?

When using accelerate, attention_mask and position_ids were being retransferred for every layer after the first device. This change transfers them once in advance.

Who can review?

@ArthurZucker @pacman100

Apr 21 '23 11:04 xloem

Not really in favor of this, if the problem is with the way accelerate handles a for loop, should be solved in accelerate. cc @sgugger

Apr 21 '23 11:04 ArthurZucker

I am imagining accelerate might be able to store a cache of variables by object id so as to not repeatedly transfer the same one. When to empty such a cache is unclear to me.

Apr 21 '23 11:04 xloem

The documentation is not available anymore as the PR was closed or merged.

Apr 21 '23 12:04 HuggingFaceDocBuilderDev

This is already done by Accelerate behind the scenes, so there is no need for this PR.

Apr 21 '23 13:04 sgugger

@sgugger Accelerate moves the weights prior to the model forward function. Since attention_mask and position_ids (unlike hidden_states) are never returned back from a forward function, it moves them again and again for every layer.

Apr 21 '23 13:04 xloem

I'm not sure the actual time you lose for that is worth changing the code in Transformers however.

Apr 21 '23 13:04 sgugger

I’m on a system with iommu=soft where data transfer is very slow. I wanted to provide numbers for the speed change on my system, which was significant enough that I opened this PR, before closing it out. However, I am busy for a day or two.

Regardless it is clear that you would prefer a solution be found for accelerate than transformers. I opened this after seeing the various PP commits adding similar code, although they are addressing a more serious issue.

I’ll come back to this to add my numbers or feel free to close it out for now.

Apr 22 '23 11:04 xloem

Running llama 65b with software iommu, this change drops my inference time from 25.11 s/token to 19.45 s/token, which is 22.5% of the inference delay. Thoughts?

Apr 23 '23 12:04 xloem

I discussed it more internally with other core maintainers and we decided this falls into specific-hardware optimizations that we don't accept to avoid bloating the code of the models. You can still use your changes locally and share them with others via our code in the Hub API though.

Apr 25 '23 13:04 sgugger

Thanks for your consideration and ideas of other approaches.

Apr 25 '23 19:04 xloem

transformers
transformers copied to clipboard

Place static llama variables for multigpu

What does this PR do?

Who can review?

transformers transformers copied to clipboard

Place static llama variables for multigpu

What does this PR do?

Who can review?

transformers
transformers copied to clipboard