What does this PR do?

Fixes #20072 and addresses the second part of https://github.com/huggingface/peft/issues/868

Models with offloaded weights are currently incompatible with save_pretrained. This PR allows large models that are loaded onto the gpu and cpu to be saved, which is particularly useful for big models that have undergone merging and unloading via https://github.com/huggingface/peft/pull/1063.

The implementation is to iterate through modules and onload parameters to the execution device (typically gpu) before sending the appropriate elements of the state dict to the cpu in-place, where the final state dictionary is assembled and saved.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ x] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Still working on the tests (some small models are not compatible with offloading due to architectural considerations) but am happy to submit a colab version with a large model in the meantime:)

Who can review?

Anyone! @pacman100

Nov 09 '23 17:11 blbadger

cc @LysandreJik

Nov 09 '23 18:11 amyeroberts

Curious what you think @muellerzr @pacman100

Nov 10 '23 16:11 LysandreJik

Overall I can see this being a pretty nice idea. Made some nits to improve. cc @SunMarc for your thoughts as well :)

Nov 10 '23 16:11 muellerzr

Would it be possible to eventually offload this to accelerate?

Nov 13 '23 16:11 LysandreJik

Would it be possible to eventually offload this to accelerate?

Yes, we can definitely put it in a get_offloaded_model_state_dict() function in accelerate and use it here.

Nov 14 '23 15:11 SunMarc

Thanks very much for reviewing everyone! @muellerzr @SunMarc your suggestions are much appreciated, I have gone ahead and implemented pretty much all of them.

A test for save_pretrained applied to a model with offloaded parameters has been added to test_modeling_utils.py. I previously used an equality test on the state dictionaries themselves and can add that too if that would be desirable.

Nov 15 '23 02:11 blbadger

@SunMarc sounds good to me! Thanks for the review and for giving pointers on where the method would live as well. I will open a PR in accelerate for adding the function get_state_dict_offloaded_model() with the proper integrations.

Nov 15 '23 20:11 blbadger

@ArthurZucker thanks much for the review!

Just to make sure I understand correctly, are you referring to adding compatibility for an offloaded model to be saved when it does not fit in cpu memory, with a possible method being onloading the state dictionary in parts and saving these parts as shards?

I certainly agree that it would be best to be able to save models that don't fit in cpu memory, rather than just gpu memory as is currently the case.

As a general question to guide this, should these changes be added here or in accelerate? Being that most of this PR's additions were sent to https://github.com/huggingface/accelerate/pull/2156 we could implement this compatibility feature either in transformers/modeling_utils or in accelerator.accelerate.

Nov 22 '23 21:11 blbadger

It's more like it does not make sense for me to support save_pretrained for offloaded models if we don't support the most common use case for offload which is that the RAM is not big enough to hold the model (you can't save from the GPU anyway) ! This could come in a followup PR, but it clearly makes more sense to support both instead of overloading the ram even if we have a big ram. Will also be faster

Nov 23 '23 08:11 ArthurZucker

We should be good to go for saving models too large to fit on cpu memory (although I am still cleaning up the code and adding to the test). The approach is to map shard keys to modules and onload each shard's key just before saving that shard, before offloading once again.

Safe serialization provides a wrinkle in that meta tensors are by default recognized as shared tensors. I am not sure if there is a good way around this without onloading all tensors, which is a no-go if we are trying to preserve memory. Open to suggestions on this, however.

Dec 05 '23 03:12 blbadger

Asking @muellerzr to do a first pass before me!

Dec 08 '23 13:12 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jan 02 '24 08:01 github-actions[bot]

Code has now been refactored and the test is updated, all are passing now.

I have run the test externally with small max_shard_size to confirm that the model only loads the specified amount into memory at any one time (checked via linux tools). A memory trace (ie via tracemalloc or memory_profiler) can be added to the test if that is considered desirable too.

Jan 03 '24 01:01 blbadger

Went ahead and added cpu memory checks via psutil to the test. The idea is that the model.save_pretrained function will onload no more memory than max_shard_size from the disk to the cpu at any given time during the saving process, accounting for a margin of error that is somewhat model-dependent.

This is perhaps easier to check with a large model, but I think this test (for a very small model) is still valid as-is: if we increase the shard size substantially (or use the default shard size which is larger than the model size) the memory check fails.

Jan 13 '24 18:01 blbadger

Thanks a lot, I'm just waiting for @muellerzr to run a first pass and will review as well 🤗

Jan 15 '24 08:01 ArthurZucker

No problem, sounds good to me! I have also performed tests on disk storage used and there does not appear to be any change in storage usage compared to the head branch. I am not sure if we need a disk storage check added to the current test, however.

Jan 19 '24 01:01 blbadger

After some re-examination this approach seems fairly robust and straightforward with the exception of the sharded memory allocation. It appears to me that the current block allocation using id_tensor_storage() will not suffice for saving offloaded modules (hence the extra conditions added to avoid classifying 'meta' device tensors as shared tensors), but I may be overlooking something there @muellerzr

Feb 14 '24 01:02 blbadger

Added a warning for the user to check that the free memory exceeds the shard_size, which may not be the case for `device_map="auto" and would cause saving to fail otherwise without a clear reason. I have tested on some recent models with more exotic architectures (mixtral 8x7b etc.) and everything looks good there too. cc @ArthurZucker

Feb 23 '24 18:02 blbadger

cc @SunMarc could you review first? since Zach is off!

Feb 27 '24 06:02 ArthurZucker

BTW great work and thanks so much for your patience!

Feb 27 '24 06:02 ArthurZucker

@ ArthurZucker no problem, thanks for continuing with this!

Feb 29 '24 16:02 blbadger

Thanks very much for your review @SunMarc ! I would be happy with moving some of this to accelerate, that seems like a natural place especially for the code you pointed to. Nice one thinking of using find_tied_parameters too, that should solve the problem of getting tied weights on the meta device. I should have time to go through your suggestions in the coming days and will add a PR in accelerate for get_state_dict_from_offload as well.

Mar 30 '24 03:03 blbadger

I have finished sending the accelerate -specific code to that library and the necessary integrations here, and after testing with some large models everything looks good.

Just for everyone's awareness, it looks like the psutils memory check in the test_save_offloaded_model test is not entirely reliable (fails around one in five times on my local machine). This appears to be due to the inherent noise in background process memory allocation compared to the small test model used, as adapting the test to much larger models removes this issue. I am not sure that we would want to use a large model, however, as this makes the test quite slow.

@SunMarc @ArthurZucker

Apr 19 '24 16:04 blbadger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 14 '24 08:05 github-actions[bot]

@SunMarc thanks very much for taking a look! No worries, I have been very busy too and would not have had much time to work on this before now anyways. I will plan make time to go through your suggestions tomorrow and will let you know if I can't make the finishing touches myself, in which case you would be more than welcome to do so

May 30 '24 04:05 blbadger

@SunMarc went ahead and implemented your suggestions, thanks! I was not sure which version of accelerate we should check for (the current version if we merge PR #2619 asap or the next version?) however.

May 31 '24 03:05 blbadger

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jun 04 '24 09:06 HuggingFaceDocBuilderDev

Thank you again @blbadger for your patience and your work ! I really appreciate your contribution 🔥 Congrats on merging this amazing feature !

Jun 07 '24 12:06 SunMarc

Happy to contribute! Thanks very much @SunMarc for shepherding this through and @amyeroberts @muellerzr @ArthurZucker for your reviews.

Jun 07 '24 22:06 blbadger

transformers
transformers copied to clipboard

Extend save_pretrained to offloaded models

What does this PR do?

Before submitting

Who can review?

transformers transformers copied to clipboard

Extend save_pretrained to offloaded models

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard