transformers
transformers copied to clipboard
Adding _tie_weights() to more models
What does this PR do?
This is a follow-up to #28947. It turns out that the CI only runs a small subset of tests, so there are quite a bit of sneaky failures throughout the tests.
I had to explicitly run the following command (it will take quite a long time): pytest -k "test_save_load_low_cpu_mem_usage" tests/
Going forward, we should probably ask devs to run this command when they modify a common test.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [x] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@amyeroberts
Still working through another 10 failing cases (where it's not so obvious), so only marking as draft for now.
tests/models/flava/test_modeling_flava.py .....F
tests/models/encodec/test_modeling_encodec.py F
tests/models/fsmt/test_modeling_fsmt.py F
tests/models/lxmert/test_modeling_lxmert.py F
tests/models/marian/test_modeling_marian.py F.
tests/models/musicgen/test_modeling_musicgen.py .F
tests/models/sew/test_modeling_sew.py F
tests/models/sew_d/test_modeling_sew_d.py F
tests/models/timm_backbone/test_modeling_timm_backbone.py F
Cool, I fixed the remaining missing _tie_weights(), and also added some more skip tests for some special audio/vision models (many failing due to use of nn.utils.weight_norm
).
I ran the following and made sure all the tests pass:
pytest -k test_save_load_low_cpu_mem_usage tests/
Also, what do you think documenting the need to run the above command when modifying common tests? Perhaps it can go into the Contributions page? I can follow up with this if you think it makes sense.
The test failure in tests_tf looks unrelated. Any chance you can kick off a re-run of the CI checks? 🙏
Also, I've verified again that pytest -k test_save_load_low_cpu_mem_usage tests/
passes.
cc @SunMarc :)
Thanks for the review!
I added the explanation of tie_weights() from my research, but it'd be great to get some feedback from someone who's more knowledgeable on this.
cc @SunMarc @muellerzr
Don't meant to be pushy, but the tests for the models in this change are currently broken in main/HEAD, so I'd be grateful if you could give this a look in the next couple of days. Thanks!
cc @SunMarc @muellerzr
Don't meant to be pushy, but the tests for the models in this change are currently broken in main/HEAD, so I'd be grateful if you could give this a look in the next couple of days. Thanks!
I added PR #29118 to skip the failing tests, so we'd have more time to discuss this change. Feel free to comment/review whenever you get a chance. Thanks and greatly appreciate your input on this!
For more context, the tie_weights() I'm adding here should enable loading those models with low_cpu_mem_usage=True (it's currently unsupported for those models).
The tie_weights() should also be helpful in getting accelerate to work on more models, since we need it to properly infer the device map.
Cheers.
Sorry I added the wrong link in the PR description, this issue is a follow up of #28948. There's context in that link (tl;dr adding the tie_weights() enable those models to be loaded with low_cpu_mem_usage=True)
We're adding new functionality with these tie_weights(). We're basically adding support for low_cpu_mem_usage=True for all these models.
The functionality kind of snowballed out another unrelated change for SDPA #28802, since the common test for SDPA uses low_cpu_mem_usage. I looked into it, and figured it could be a good idea to add support for low_cpu_mem_usage to a bunch of models as well while I'm at it.
- this feels like a big regression: having to write and add the
def _tie_weights(self):
for all of these model is a bit strange to me- are we not changing the API by supporting bias tie? If yes I am against it. That should only be done manually by the user!
The weights were already tied in __init__
(with self.decoder.bias = self.bias), but those get untied when loading through low_cpu_mem_usage, and needs to get retied.
If we are not loading with low_cpu_mem_usage, those biases will already be tied. If we save with save_pretrained, only one copy of those biases will be saved.
- could you sum up in 1 line what was wrong with the previous behaviour and why we need to tie biases for these models?
Those models fail to load with low_cpu_mem_usage=True.
- copied from should be used
Good point. A lot of these prediction heads were more-or-less copied from other existing heads, but not sure why they were not marked copied-from. I'll see if I can add back some fix-copies.
Hi @hackyon I haven't followed your work (btw, thank you for the contribution!), but just jump in to let you know:
if you put a prefix [test all]
in a commit message, that commit will trigger a full CI run.
For example, a commit message like [test all] check my commit is perfect
.
This way, you have an easy way to check if the changes are all good.
(Our test fetcher is a trade-off of coverage and speed. But we will try to improve it)
Thanks for the details! Alright let's make sure we take everything into account, safe serialization and unsafe (bin as well). You can also just use copied from for single functions, but the idea is to make sure we have a fix in a single place and the rest is just the copy of the fix! Same for the test, copied from can be used for the new test
The latest commit should be based off of main/HEAD and has all the necessary changes to tie_weights().
It includes additional tests for safe tensors and also checkpoint bins, and also more checks to ensure that the models are equivalent after loading. I added @slow to them since it's a lot more checks now, and the whole thing takes a longer to run.
I checked that there were already copied-from for many of the LM heads, and the ones that don't tend to have a good reason not to be copied-from.
@SunMarc @younesbelkada
Any other insights or concerns over the use of tie_weights() here? Thanks!
@SunMarc @younesbelkada
I just sync'd latest HEAD. Would you mind taking a quick look? Thanks!
Cool, thanks for your review @SunMarc!
@ArthurZucker - would you mind taking a quick look again? Thank you 🙏
The modification on the models looks good to me ! The only thing i'm concerned is the tests. Do we really want to add tests for
low_cpu_mem_usage=True
since
- we would have to skip tests if they are not passing for all models
- Takes a lot of time to run
- these tests are kind of already implemented in the
accelerate_tests
:test_disk_offload_bin
,test_disk_offload_safetensors
,test_cpu_offload
,test_model_parallelism
if we make it compatible withdevice_map
Good point. I believe that just the test_save_load_low_cpu_mem_usage
would probably be sufficient as well, but I added the other tests as per the suggestion in one of the review comments. The tests are only run in RUN_SLOW, but if they are redundant, then maybe it's still not worth adding.
I'll leave it up to @ArthurZucker to decide.
If we manage to define the
no_split_modules
(should be pretty easy to make it work),low_cpu_mem_usage=True
is the same as doingdevice_map = {"":"cpu"}
ordevice_map = "cpu"
. And it would also work for multi-gpu, cpu and disk offload . Contrary todevice_map
where the model needs_no_split_modules
to be defined, the user can freely use thelow_cpu_mem_usage
argument. So, we should definitely test this somehow or warn the users that it might not work properly. LMK what you think @hackyon @ArthurZucker !
I believe low_cpu_mem_usage
should work even without no_split_modules
, no? AFAIT the low_cpu_mem_usage
flow doesn't do any model splitting, it just tries to load the models with the meta
device and try to keep one copy of the weights in memory at a time.
Yea, I won't mind helping to add more no_split_modules
if that is something you think is relatively high priority. From what I can tell, it's just a matter of checking for residual connections and making sure those are not split? I get the sense that there is a bit of a backlog of PRs recently (and I understand some of you might be overworked!), so I wouldn't want to push on new initiatives unnecessarily.
Hi @hackyon, thanks for still working on this PR !
Yea, I won't mind helping to add more no_split_modules if that is something you think is relatively high priority. From what I can tell, it's just a matter of checking for residual connections and making sure those are not split? I get the sense that there is a bit of a backlog of PRs recently (and I understand some of you might be overworked!), so I wouldn't want to push on new initiatives unnecessarily.
I think it could be great to add no_split_modules
to the most used model on the hub ! Whenever there is a new model coming out, we try to make it compatible with device_map
but for older model, we might have missed some of them. I can definitely review your PR if you do so ! Related issue.
@SunMarc Nice! Sounds like there is some activity in #29786.
Do you have permissions to merge open source code into main? Otherwise, do you know who else might have? I do have some bandwidth to work on in over the next 2-3 weeks, but I'd like to get this current PR in first before I start anything else (and so far I'm not receiving responses for my PRs, so there's no point in contributing more at this point).
Hi @hackyon, I can indeed merge but the code needs to be validated from a core maintainer. Could you have a look @ArthurZucker ? Thanks again @hackyon for your patience.
Let me have a look!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Thank you for taking a look and merging! 🙏