transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Adding _tie_weights() to more models

Open hackyon opened this issue 1 year ago • 6 comments

What does this PR do?

This is a follow-up to #28947. It turns out that the CI only runs a small subset of tests, so there are quite a bit of sneaky failures throughout the tests.

I had to explicitly run the following command (it will take quite a long time): pytest -k "test_save_load_low_cpu_mem_usage" tests/

Going forward, we should probably ask devs to run this command when they modify a common test.

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Pull Request section?
  • [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@amyeroberts

hackyon avatar Feb 14 '24 22:02 hackyon

Still working through another 10 failing cases (where it's not so obvious), so only marking as draft for now.

tests/models/flava/test_modeling_flava.py .....F tests/models/encodec/test_modeling_encodec.py F tests/models/fsmt/test_modeling_fsmt.py F tests/models/lxmert/test_modeling_lxmert.py F
tests/models/marian/test_modeling_marian.py F.

tests/models/musicgen/test_modeling_musicgen.py .F

tests/models/sew/test_modeling_sew.py F
tests/models/sew_d/test_modeling_sew_d.py F

tests/models/timm_backbone/test_modeling_timm_backbone.py F

hackyon avatar Feb 14 '24 22:02 hackyon

Cool, I fixed the remaining missing _tie_weights(), and also added some more skip tests for some special audio/vision models (many failing due to use of nn.utils.weight_norm).

I ran the following and made sure all the tests pass: pytest -k test_save_load_low_cpu_mem_usage tests/

Also, what do you think documenting the need to run the above command when modifying common tests? Perhaps it can go into the Contributions page? I can follow up with this if you think it makes sense.

hackyon avatar Feb 15 '24 15:02 hackyon

The test failure in tests_tf looks unrelated. Any chance you can kick off a re-run of the CI checks? 🙏

Also, I've verified again that pytest -k test_save_load_low_cpu_mem_usage tests/ passes.

hackyon avatar Feb 15 '24 19:02 hackyon

cc @SunMarc :)

muellerzr avatar Feb 15 '24 20:02 muellerzr

Thanks for the review!

I added the explanation of tie_weights() from my research, but it'd be great to get some feedback from someone who's more knowledgeable on this.

hackyon avatar Feb 15 '24 20:02 hackyon

cc @SunMarc @muellerzr

Don't meant to be pushy, but the tests for the models in this change are currently broken in main/HEAD, so I'd be grateful if you could give this a look in the next couple of days. Thanks!

hackyon avatar Feb 16 '24 17:02 hackyon

cc @SunMarc @muellerzr

Don't meant to be pushy, but the tests for the models in this change are currently broken in main/HEAD, so I'd be grateful if you could give this a look in the next couple of days. Thanks!

I added PR #29118 to skip the failing tests, so we'd have more time to discuss this change. Feel free to comment/review whenever you get a chance. Thanks and greatly appreciate your input on this!

For more context, the tie_weights() I'm adding here should enable loading those models with low_cpu_mem_usage=True (it's currently unsupported for those models).

The tie_weights() should also be helpful in getting accelerate to work on more models, since we need it to properly infer the device map.

Cheers.

hackyon avatar Feb 19 '24 20:02 hackyon

Sorry I added the wrong link in the PR description, this issue is a follow up of #28948. There's context in that link (tl;dr adding the tie_weights() enable those models to be loaded with low_cpu_mem_usage=True)

We're adding new functionality with these tie_weights(). We're basically adding support for low_cpu_mem_usage=True for all these models.

The functionality kind of snowballed out another unrelated change for SDPA #28802, since the common test for SDPA uses low_cpu_mem_usage. I looked into it, and figured it could be a good idea to add support for low_cpu_mem_usage to a bunch of models as well while I'm at it.

  • this feels like a big regression: having to write and add the def _tie_weights(self): for all of these model is a bit strange to me
  • are we not changing the API by supporting bias tie? If yes I am against it. That should only be done manually by the user!

The weights were already tied in __init__ (with self.decoder.bias = self.bias), but those get untied when loading through low_cpu_mem_usage, and needs to get retied.

If we are not loading with low_cpu_mem_usage, those biases will already be tied. If we save with save_pretrained, only one copy of those biases will be saved.

  • could you sum up in 1 line what was wrong with the previous behaviour and why we need to tie biases for these models?

Those models fail to load with low_cpu_mem_usage=True.

  • copied from should be used

Good point. A lot of these prediction heads were more-or-less copied from other existing heads, but not sure why they were not marked copied-from. I'll see if I can add back some fix-copies.

hackyon avatar Feb 20 '24 12:02 hackyon

Hi @hackyon I haven't followed your work (btw, thank you for the contribution!), but just jump in to let you know:

if you put a prefix [test all] in a commit message, that commit will trigger a full CI run.

For example, a commit message like [test all] check my commit is perfect.

This way, you have an easy way to check if the changes are all good.

(Our test fetcher is a trade-off of coverage and speed. But we will try to improve it)

ydshieh avatar Feb 20 '24 12:02 ydshieh

Thanks for the details! Alright let's make sure we take everything into account, safe serialization and unsafe (bin as well). You can also just use copied from for single functions, but the idea is to make sure we have a fix in a single place and the rest is just the copy of the fix! Same for the test, copied from can be used for the new test

ArthurZucker avatar Feb 20 '24 13:02 ArthurZucker

The latest commit should be based off of main/HEAD and has all the necessary changes to tie_weights().

It includes additional tests for safe tensors and also checkpoint bins, and also more checks to ensure that the models are equivalent after loading. I added @slow to them since it's a lot more checks now, and the whole thing takes a longer to run.

I checked that there were already copied-from for many of the LM heads, and the ones that don't tend to have a good reason not to be copied-from.

hackyon avatar Feb 23 '24 14:02 hackyon

@SunMarc @younesbelkada

Any other insights or concerns over the use of tie_weights() here? Thanks!

hackyon avatar Mar 06 '24 03:03 hackyon

@SunMarc @younesbelkada

I just sync'd latest HEAD. Would you mind taking a quick look? Thanks!

hackyon avatar Mar 19 '24 15:03 hackyon

Cool, thanks for your review @SunMarc!

@ArthurZucker - would you mind taking a quick look again? Thank you 🙏

The modification on the models looks good to me ! The only thing i'm concerned is the tests. Do we really want to add tests for low_cpu_mem_usage=True since

  • we would have to skip tests if they are not passing for all models
  • Takes a lot of time to run
  • these tests are kind of already implemented in the accelerate_tests: test_disk_offload_bin, test_disk_offload_safetensors, test_cpu_offload, test_model_parallelism if we make it compatible with device_map

Good point. I believe that just the test_save_load_low_cpu_mem_usage would probably be sufficient as well, but I added the other tests as per the suggestion in one of the review comments. The tests are only run in RUN_SLOW, but if they are redundant, then maybe it's still not worth adding.

I'll leave it up to @ArthurZucker to decide.

If we manage to define the no_split_modules (should be pretty easy to make it work), low_cpu_mem_usage=True is the same as doing device_map = {"":"cpu"} or device_map = "cpu". And it would also work for multi-gpu, cpu and disk offload . Contrary to device_map where the model needs _no_split_modules to be defined, the user can freely use the low_cpu_mem_usage argument. So, we should definitely test this somehow or warn the users that it might not work properly. LMK what you think @hackyon @ArthurZucker !

I believe low_cpu_mem_usage should work even without no_split_modules, no? AFAIT the low_cpu_mem_usage flow doesn't do any model splitting, it just tries to load the models with the meta device and try to keep one copy of the weights in memory at a time.

Yea, I won't mind helping to add more no_split_modules if that is something you think is relatively high priority. From what I can tell, it's just a matter of checking for residual connections and making sure those are not split? I get the sense that there is a bit of a backlog of PRs recently (and I understand some of you might be overworked!), so I wouldn't want to push on new initiatives unnecessarily.

hackyon avatar Apr 16 '24 17:04 hackyon

Hi @hackyon, thanks for still working on this PR !

Yea, I won't mind helping to add more no_split_modules if that is something you think is relatively high priority. From what I can tell, it's just a matter of checking for residual connections and making sure those are not split? I get the sense that there is a bit of a backlog of PRs recently (and I understand some of you might be overworked!), so I wouldn't want to push on new initiatives unnecessarily.

I think it could be great to add no_split_modules to the most used model on the hub ! Whenever there is a new model coming out, we try to make it compatible with device_map but for older model, we might have missed some of them. I can definitely review your PR if you do so ! Related issue.

SunMarc avatar Apr 17 '24 09:04 SunMarc

@SunMarc Nice! Sounds like there is some activity in #29786.

Do you have permissions to merge open source code into main? Otherwise, do you know who else might have? I do have some bandwidth to work on in over the next 2-3 weeks, but I'd like to get this current PR in first before I start anything else (and so far I'm not receiving responses for my PRs, so there's no point in contributing more at this point).

hackyon avatar Apr 22 '24 13:04 hackyon

Hi @hackyon, I can indeed merge but the code needs to be validated from a core maintainer. Could you have a look @ArthurZucker ? Thanks again @hackyon for your patience.

SunMarc avatar Apr 22 '24 13:04 SunMarc

Let me have a look!

ArthurZucker avatar Apr 23 '24 12:04 ArthurZucker

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Thank you for taking a look and merging! 🙏

hackyon avatar May 07 '24 15:05 hackyon