What does this PR do?

After a thorough investigation, I was finding that doing AutoModel was far too slow given a super fast m.2 drive.

2 years ago when Sylvain introduced the sharding mechanic, PyTorch's _load_from_state_dict did not copy parameters in a module's descendants. This isn't true today however, and the docs explicitly mention this!

On top of that, if we set assign=True when loading in the state dict (and stay to PyTorch factory) we see a tremendous speedup with less memory usage overall.

Overall this PR introduces a "skipped" weight init when doing CPU weight loading, where we instead choose to load the weights in when an input is passed through the CPU (lazy-loading). Doing so can decrease the model init time by a large factor, while basically "borrowing" time from the first call (which will be slower) before other calls are then faster as a result.

Example model init time:

Model	Before fix (s)	After fix (s)
llama-3-8B	1.858	0.183
llama-3-70B	30.36	1.238

Example model throughput:

First pass

Note: this is special as the model weights get loaded in fully here, so it will be slower

Model	Before fix (tok/s)	After fix (tok/s)
llama-3-8B	2.416	2.382
llama-3-70B	0.286	0.256

Second pass

Model	Before fix (tok/s)	After fix (tok/s)
llama-3-8B	2.470	2.445
llama-3-70B	0.290	0.288

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@LysandreJik @amyeroberts @SunMarc

Jul 03 '24 12:07 muellerzr

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jul 03 '24 12:07 HuggingFaceDocBuilderDev

Note: we're seeing some failures with encoder/decoder models that don't have tied weights. Not fully sure what's up there but @SunMarc is investigating

Jul 03 '24 13:07 muellerzr

This can allegedly also increase throughput from model.generate()... or at least that's what I'm seeing.

Setup:

import time
from accelerate.utils import set_seed
from transformers import LlamaForCausalLM, AutoTokenizer

set_seed(42)


file_size = 132 # Size in GB of the weights
factory_model = LlamaForCausalLM.from_pretrained("/mnt/superfast/llama-3-8B")

tokenizer = AutoTokenizer.from_pretrained("/mnt/superfast/llama-3-8B")
inputs = tokenizer("Blue is my favorite color. What is my favorite color?", return_tensors="pt")

start_time = time.time()
output = factory_model.generate(**inputs, max_new_tokens=20, num_return_sequences=1)

end_time = time.time()
time_taken = end_time - start_time
print(f"inference time={time_taken:.3f} seconds")
print(f"speed={file_size/time_taken:.3f} GB/second")

new_tokens = len(output[0]) - inputs.input_ids.shape[1]

print(f'tok/s={new_tokens/time_taken:.3f}')

Current setup in HF:

inference time=24.841 seconds
speed=5.314 GB/second
tok/s=0.805

New version:

inference time=9.205 seconds
speed=14.341 GB/second
tok/s=2.173

Jul 03 '24 14:07 muellerzr

Did some tests with optimum-benchmark, new throughput results on CPU:

transformers main: 0.24 tokens/s
transformers my branch: 2.46 tokens/s

Jul 03 '24 15:07 muellerzr

@msaroufim if you have a moment, could you give this a look to check that everything makes sense here per my understanding of how we should be loading in model weights, etc? Would be very appreciative of your eyes/take on this

Jul 03 '24 20:07 muellerzr

For transparency, here is the script I'm using: https://gist.github.com/muellerzr/7239668f61baff5726f556d30d2af5f5

Jul 03 '24 20:07 muellerzr

💛 💛 💛 (local-gemma is very happy with this)

Jul 04 '24 09:07 gante

Got confirmation from Mark S (thanks Mark for looking this over) and this is indeed correct 🔥

Jul 05 '24 12:07 muellerzr

A very important caveat @SunMarc and I discovered today: set_module_tensor_to_device is very slow, so if users do device_map="auto" they will not see this speedup and it will still be slow.

This works for now as a small fix to users who load everything on CPU instead first/don't do device_map="auto", but there is more work we need to do

Jul 05 '24 13:07 muellerzr

About the failing tests, we had the following ones:

FAILED tests/models/seamless_m4t_v2/test_modeling_seamless_m4t_v2.py::SeamlessM4Tv2ModelWithSpeechInputTest::test_load_save_without_tied_weights - AssertionError: SeamlessM4Tv2Model: Tensor text_encoder.embed_tokens.weight: Tensor-likes are not close!

Mismatched elements: 114 / 120 (95.0%)
Greatest absolute difference: 0.10039517283439636 at index (12, 5) (up to 1e-05 allowed)
Greatest relative difference: 63.95556640625 at index (16, 0) (up to 1.3e-06 allowed)
FAILED tests/models/switch_transformers/test_modeling_switch_transformers.py::SwitchTransformersModelTest::test_load_save_without_tied_weights - AssertionError: SwitchTransformersModel: Tensor encoder.embed_tokens.weight: Tensor-likes are not close!

Mismatched elements: 3163 / 3168 (99.8%)
Greatest absolute difference: 0.010853035375475883 at index (47, 11) (up to 1e-05 allowed)
Greatest relative difference: 8950.5673828125 at index (4, 15) (up to 1.3e-06 allowed)
FAILED tests/models/m2m_100/test_modeling_m2m_100.py::M2M100ModelTest::test_load_save_without_tied_weights - AssertionError: M2M100Model: Tensor encoder.embed_tokens.weight: Tensor-likes are not close!

Mismatched elements: 1566 / 1584 (98.9%)
Greatest absolute difference: 0.0897781103849411 at index (57, 3) (up to 1e-05 allowed)
Greatest relative difference: 2123.130126953125 at index (23, 3) (up to 1.3e-06 allowed)
FAILED tests/models/switch_transformers/test_modeling_switch_transformers.py::SwitchTransformersEncoderOnlyModelTest::test_load_save_without_tied_weights - AssertionError: SwitchTransformersEncoderModel: Tensor encoder.embed_tokens.weight: Tensor-likes are not close!

Mismatched elements: 3159 / 3168 (99.7%)
Greatest absolute difference: 0.010908672586083412 at index (1, 3) (up to 1e-05 allowed)
Greatest relative difference: 4020.068115234375 at index (39, 5) (up to 1.3e-06 allowed)
FAILED tests/models/seamless_m4t_v2/test_modeling_seamless_m4t_v2.py::SeamlessM4Tv2ModelWithTextInputTest::test_load_save_without_tied_weights - AssertionError: SeamlessM4Tv2Model: Tensor text_encoder.embed_tokens.weight: Tensor-likes are not close!

Mismatched elements: 114 / 120 (95.0%)
Greatest absolute difference: 0.08684197068214417 at index (9, 0) (up to 1e-05 allowed)
Greatest relative difference: 1055.23974609375 at index (1, 0) (up to 1.3e-06 allowed)
FAILED tests/models/speech_encoder_decoder/test_modeling_speech_encoder_decoder.py::Wav2Vec2BertModelTest::test_save_and_load_from_pretrained - AssertionError: 0.6309062 not less than or equal to 1e-05
=========== 6 failed, 2815 passed, 3759 skipped in 68.00s (0:01:08) ============

it is a bit complicated but basically, these tests were not supposed to pass initially. However, they passed in the end because the weights were tied by default (even when config.tie_word_embeddings =False) and modifying shared layers caused the other layers to be modified too without needing to retie the weights.

For example, this is the architecture of SeamlessM4Tv2Model. We see that we have a shared layer by default without any config attribute to make it optional.

    def __init__(self, config, current_modality="text"):
        super().__init__(config)

        self.shared = nn.Embedding(config.vocab_size, config.hidden_size, config.pad_token_id)

        self.text_encoder = SeamlessM4Tv2Encoder(config, self.shared)
        self.speech_encoder = SeamlessM4Tv2SpeechEncoder(config)
        self.text_decoder = SeamlessM4Tv2Decoder(config, self.shared)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

With this PR, we've set assign =True and by doing this, we recreated the shared layer, breaking the tied weights. Since tie_weights does nothing (config.tie_word_embeddings =False), we get different values in the end.

assign (bool, optional): When ``False``, the properties of the tensors
                in the current module are preserved while when ``True``, the
                properties of the Tensors in the state dict are preserved. The only
                exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s
                for which the value from the module is preserved.
                Default: ``False``

So, the conclusion is that there should be no problem with the modification you did, I just need to skip/modify the tests. cc @muellerzr

In the future, we just need to make sure that when we have shared weight by default, we skip the tests or add the possibility to remove these shared weights.

Jul 05 '24 13:07 SunMarc

Okay ran some traces and I think this makes sense to me now.

Compare the following first calls to .generate():

Baseline:

-----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
-----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          aten::clone         0.03%       6.590ms         0.20%      45.418ms      33.844us     359.36 Mb     -54.77 Mb          1342  
                                        aten::reshape         0.02%       4.874ms         0.22%      49.277ms       6.960us     330.34 Mb     344.00 Kb          7080  
                                     aten::empty_like         0.01%       2.366ms         0.02%       5.526ms       4.118us     325.24 Mb     125.18 Mb          1342  
                                          aten::empty         0.02%       3.810ms         0.02%       3.810ms       0.932us     266.56 Mb     266.56 Mb          4087  
                                         aten::matmul         0.11%      25.907ms        96.87%       22.083s       4.296ms     149.42 Mb    -344.00 Kb          5140  
                                         aten::linear         2.79%     635.090ms        96.92%       22.094s       4.910ms     149.22 Mb       3.75 Mb          4500

Fix:

-----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
-----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          aten::clone         0.07%       5.631ms         0.25%      20.631ms      15.373us     194.36 Mb     -31.73 Mb          1342  
                                        aten::reshape         0.14%      11.557ms         0.41%      34.781ms       4.913us     165.00 Mb           0 b          7080  
                                     aten::empty_like         0.04%       3.059ms         0.06%       4.894ms       3.647us     182.18 Mb      57.70 Mb          1342  
                                          aten::empty         0.03%       2.357ms         0.03%       2.357ms       0.577us     160.70 Mb     160.70 Mb          4083  
                                         aten::matmul         0.53%      44.745ms        88.49%        7.438s       1.447ms      74.81 Mb           0 b          5140  
                                         aten::linear         0.70%      58.470ms        88.82%        7.466s       1.659ms      74.61 Mb     550.00 Kb          4500

What this hints at here I believe is because we are using mmap, the time it takes to read in the file from mmap'd RAM + perform the operation itself is faster than how things are currently. As a result, I saw that the first run was ~.2s longer than latter runs, which I think comes from reading from disk/allocating memory, which would make sense given it has ~8s to read all this in.

And because it's mmap'd, we also only need a layer at a time to directly replace them all, which is why it seems that we use less memory.

Jul 08 '24 14:07 muellerzr

@muellerzr, I am curious about the I/O speeds in your OP. Can you please confirm that you are transferring weights from NVMe to HBM at 75-90GB/sec? Are you able to share PCIe and m.2 specs? Thanks

Jul 08 '24 15:07 tjruwase

@tjruwase the answer I've come to (as in my last post) is mmapping is covering the transition from m.2 when we first pass an input through the model (I think). Weights are only allocated in space, but not fully loaded in. Hence why I'm seeing far above what my actual M.2 drive can bring in, but at 8s of time to bring in said weights is reasonable if done quickly!

(Because yes, I'd love to know what planet has a 75-90GB/s non-RAID M.2 as well!)

My setup:

NVME: Crucial T705 2TB
Memory: 192GB DDR5 running at 3600 MT/s
CPU: AMD Ryzen 9 7950X
MOBO is a Asus ProArt X670E-CREATOR, my 2x 4090's are running on x8/x8

Let me know how much more specific I can get with this for you!

Jul 08 '24 15:07 muellerzr

(Because yes, I'd love to know what planet has a 75-90GB/s non-RAID M.2 as well!)

@muellerzr, thanks for the clarification. As you may have guessed fast I/O is a passion, and I am also awaiting the above :).

Jul 08 '24 15:07 tjruwase

@tjruwase do let me know if you see anything else odd about what I’ve done here etc too/if you have insights. I’ll look into the DeepSpeed stuff in a few days!

Jul 08 '24 15:07 muellerzr

@muellerzr, your NVMe is blazingly fast, ~14GB/sec reads. May I request your contribution to the following? https://github.com/microsoft/DeepSpeed/issues/998

Jul 08 '24 15:07 tjruwase

@tjruwase do let me know if you see anything else odd about what I’ve done here etc too/if you have insights. I’ll look into the DeepSpeed stuff in a few days!

@muellerzr, nothing looks good. This is truly amazing work that you have done here, kudos!

Do let me know if my suggestion for updating sharded DeepSpeed weights above is insufficient or problematic.

Jul 08 '24 15:07 tjruwase

Okay! After a ton of thorough testing I've proven that:

When loading via device_map="auto", it's the same speed as though loading the model in half precision and doing .cuda()
When loading via device_map="auto" and on CPU, it's the same speed again (the faster speed)
Part of my issue was not specifying torch_dtype=torch.bfloat16 when doing llama tests, so a ton of time was wasted upcasting to float32, something others may do by accident too since it's not done by default I was finding. (not sure what we can do about that, just something I noticed)
I did notice a slight speedup when doing model = LlamaForCausalLM.from_pretrained(llama_path, torch_dtype=torch.bfloat16).cuda() during model loading when compared to our current implementation, so it still can help.
When doing factory_model = LlamaForCausalLM.from_pretrained("/mnt/superfast/llama-3-8B"), the model weights are loaded in bf16, Marc mentioned this might be a bad bug cc @ArthurZucker. Only if we do device_map="auto" are they loaded in fp32. (this will also cause slowdowns during model loading I found, which makes sense I think considering more parameters.
Given that those are the only changes, and OOTB this should just work, this PR from a non-DeepSpeed standpoint is good to merge.

Users will not see much speedup if they do device_map="auto" for the aforementioned reasons, but this still helps other folks out too!

Jul 09 '24 14:07 muellerzr

When I eventually ripped everything out to test, here's my full code:

from transformers import LlamaForCausalLM, AutoConfig, AutoTokenizer
from accelerate.utils import set_seed
from accelerate.big_modeling import init_empty_weights
from safetensors.torch import load_file
from pathlib import Path
import json
from safetensors import safe_open
from accelerate.utils import retie_parameters
from transformers import GenerationConfig
from transformers.utils.hub import get_checkpoint_shard_files
import time

set_seed(42)

llama_path = Path("/mnt/superfast/llama-3-8B")

tokenizer = AutoTokenizer.from_pretrained(llama_path)
inputs = tokenizer("Tell me about a girl that", return_tensors="pt")


config = AutoConfig.from_pretrained(llama_path)
use_keep_in_fp32_modules = False

resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
                llama_path,
                llama_path/"model.safetensors.index.json"
            )
loaded_state_dict_keys = sharded_metadata["all_checkpoint_keys"]

config = LlamaForCausalLM._autoset_attn_implementation(
            config, use_flash_attention_2=False, torch_dtype=None, device_map=None
        )
with init_empty_weights():
    factory_model = LlamaForCausalLM(config)

index_filename = llama_path / "model.safetensors.index.json"

with open(index_filename, "r") as f:
    index = json.load(f)

if "weight_map" in index:
    index = index["weight_map"]
checkpoint_files = sorted(list(set(index.values())))
checkpoint_files = [llama_path / f for f in checkpoint_files]

model_keys = set(factory_model.state_dict().keys())
new_state_dict = {}

for checkpoint_file in checkpoint_files:
    with safe_open(checkpoint_file, framework="pt") as f:
        metadata = f.metadata()
        weight_names = f.keys()
    file_state = load_file(checkpoint_file)
    new_state_dict.update(file_state)
factory_model.load_state_dict(new_state_dict, strict=True, assign=True)

retie_parameters(factory_model, [["lm_head.weight"]])

factory_model.eval()

factory_model.generation_config = GenerationConfig.from_pretrained(
                    llama_path
                )

start_time = time.time()
output = factory_model.generate(**inputs, max_new_tokens=20, num_return_sequences=1)

end_time = time.time()
time_taken = end_time - start_time
new_tokens = len(output[0]) - inputs.input_ids.shape[1]
print(f"{time_taken:.3f}s | {new_tokens/time_taken:.3f} tokens/second | {tokenizer.batch_decode(output, skip_special_tokens=True)} | ")

Jul 09 '24 14:07 muellerzr

@SunMarc @LysandreJik @ArthurZucker I've adjusted this title to what is really happening here. See the new updated table, basically we "borrow" a little time later on during the first pass to load the weights in, rather than doing so immediately which can load models in much faster and after the 1st pass will still be quick.

On CUDA I saw nearly no time changes either, aside from loading the model in 0.185s rather than 2s for llama-3-8B, so that's safe too :)

Jul 09 '24 18:07 muellerzr

So that we can merge this, for now I've kept the old deepspeed behavior in

Jul 09 '24 18:07 muellerzr

But LGTM otherwise, super congrats on this! 🔥

Jul 09 '24 19:07 ArthurZucker

@ArthurZucker summary of failing tests, not sure how they're supposed to work:

tests/test_pipeline_mixin.py::SummarizationPipelineTests::test_small_model_pt is failing. This is because lm_head.weight was never stored in the checkpoint and it's random. This PR now has it output just "" rather than the random text. Is that... expected? I feel like that's a mistake, no? Both my version and the prior version show that lm_head.weight was never loaded in. If I had to guess it's because the weight there rn hasn't had a random init since we're loading in from a checkpoint. Advice on moving forward there is appreciated (e.g. should we init/load in the weights not in the checkpoint as random? Not sure how they're currently allocated)
The failing test tests/models/bart/test_modeling_bart.py::BartModelTest::test_load_save_without_tied_weights I think is fine, right @SunMarc ? (more tied weights failures).

Jul 09 '24 20:07 muellerzr

Regarding 1. this is a problem for BC as it means we no longer call _init_weight on the weights that were not loaded. It's not really a mistake, it covers the case where a user wants to load only the backbone model for training, and the LM Head is not tied for example -> you follow the init scheme from the paper / the one implemented in the _init_weight methods of the PreTrainedModel.

Jul 10 '24 08:07 ArthurZucker

I think that is what's happening no?

Jul 10 '24 08:07 ArthurZucker

Great, makes sense. Yeah, need to make sure we keep that init in then basically is the issue here!

Jul 10 '24 09:07 muellerzr

The failing test tests/models/bart/test_modeling_bart.py::BartModelTest::test_load_save_without_tied_weights I think is fine, right @SunMarc ? (more tied weights failures).

Fixed all failing tests about test_load_save_without_tied_weights in the above commit. These can be safety skipped.

Jul 10 '24 14:07 SunMarc

Related: FR to torch ao to potentially have lazy upcasting/downcasting. When/if we can solve this, then we should be able to get superfast speeds going either direction (loading model == weight precision, or not) https://github.com/pytorch/pytorch/issues/130480

Jul 10 '24 16:07 muellerzr

Latest solution uses a pre_hook to convert the layer to the right dtype. This way we can keep everything mmap'd still and only convert that weight on the fly dynamically. (Aka, model init times are still the same). This hook is self-destructive and after the first call will delete itself

Jul 11 '24 15:07 muellerzr

@amyeroberts @ArthurZucker what I've done here for now is there's a few models that don't support buffer param assignments, since for one reason or another some of the weights can't map 1:1/aren't in the state dict/etc and so as a result they can't support this method.

These models now contain a supports_param_buffer_assignment attribute which is set to False. If any new models fail tests such as test_save_and_load_from_pretrained, they need to set supports_param_buffer_assignment=False in their PreTrainedModel definition.

Jul 12 '24 13:07 muellerzr

transformers
transformers copied to clipboard

Speedup model init on CPU (by 30x+ for llama-3-70B as one example)

What does this PR do?

Example model init time:

Example model throughput:

First pass

Second pass

Before submitting

Who can review?

Baseline:

transformers transformers copied to clipboard

Speedup model init on CPU (by 30x+ for llama-3-70B as one example)

What does this PR do?

Example model init time:

Example model throughput:

First pass

Second pass

Before submitting

Who can review?

Baseline:

transformers
transformers copied to clipboard