accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Feature Request: Support MS-AMP

Open winglian opened this issue 2 years ago • 8 comments

Docs

MS-AMP would allow us to also store the weights in FP8, allowing for larger models to be trained on smaller hardware, as right now the weights are still stored on device as fp16/bf16.

The implementation example they provide seems similar to accelerate.prepare(...):

model, optimizer = msamp.initialize(model, optimizer, opt_level="O2")

winglian avatar Nov 10 '23 14:11 winglian

Might be good to have this as an alternative choice, from their docs:

MS-AMP has the following benefit comparing with Transformer Engine:

Speed up memory-limited operations by accessing one byte compared to half or single-precision.
Reduce memory requirements for training models, enabling larger models.
Speed up communication for distributed model by transmitting lower precision gradients.
Reduce training time for large language models with larger minibatches.

Will work on this next week :)

muellerzr avatar Nov 10 '23 15:11 muellerzr

+++ would love to see MS-AMP supported. Currently, H100s are on par with A100s cost-wise even with the current FP8 implementation, but if MS-AMP FP8 can be implemented, it is likely anywhere between a 50-100% boost in training speed. We still need Flash Attention with FP8, but MS-AMP is a great first step towards faster training.

casper-hansen avatar Nov 11 '23 21:11 casper-hansen

@muellerzr is this branch in a state to be tested? https://github.com/huggingface/accelerate/tree/ms-amp thanks!

winglian avatar Nov 29 '23 15:11 winglian

@winglian not quite yet! But I'll let you know for you to test :) (should be by end of this week!)

muellerzr avatar Nov 29 '23 16:11 muellerzr

@winglian go ahead and try the branch out :) Note that it only works on single GPU for now (will look at deepspeed tommorow), and you shouldn't see a time decrease I don't think. What you should see though is a memory decrease for NLP based models.

For example, I ran bert-base-cased (NLP example) and saw:

FP8:
Before: 610.92 MB
After: 2.14 GB
BF16:
Before: 413.69 MB
After: 2.72 GB

But time was almost ~2x increase 😱

muellerzr avatar Nov 29 '23 19:11 muellerzr

Shouldn’t the FLOPs increase and thereby reducing training time? It should not be present on small models, but if you take a 30B, I would be surprised if you don’t see a difference

casper-hansen avatar Nov 29 '23 20:11 casper-hansen

Correct. I only tested on a tiny model just to get the API stable 😉

muellerzr avatar Nov 29 '23 20:11 muellerzr

Now that it’s a bit more stable, I saw both memory decreases and speed increases when combining MS-AMP and TransformerEngine. More details are in the PR (so overall purely positives)

muellerzr avatar Dec 07 '23 02:12 muellerzr

@muellerzr accelerate fp8 with ms-amp backend seems not work with deepspeed. However ms-amp itself support work with deepspeed (zero) https://azure.github.io/MS-AMP/docs/user-tutorial/usage/#usage-in-deepspeed

LSC527 avatar Jul 25 '24 12:07 LSC527

Correct, I'm looking into that this week

muellerzr avatar Aug 15 '24 16:08 muellerzr