transformers
transformers copied to clipboard
Add `accelerate` support for ViLT
Motivation
Add bnb support for ViLT model as it has been asked by a user in https://github.com/TimDettmers/bitsandbytes/issues/14
. This involved adding accelerate support for this model.
What does this PR do?
Adds _no_split_modules attribute at ViltModel class to support loading the model with device_map=auto. This also implied adding a .to operation inside ViltLayer.
I also redefined accelerate tests since for this model the hidden states are not deterministic. However, it is possible to check the correctness of the operation by checking some output attributes such as logits or pooler_output.
Questions
The test ViltModelIntegrationTest::test_inference_natural_language_visual_reasoning seem to never pass on my machine (aka even without _no_split_modules), is it related to something I am missing? Also it seems that those tests were failing too on the nightly run: https://github.com/huggingface/transformers/runs/7882898294?check_suite_focus=true
cc @NielsRogge @ArthurZucker @ydshieh
I also redefined accelerate tests since for this model the hidden states are not deterministic.
Could you elaborate this part in more depth, please?
Sure,
The accelerate tests defined in test_modeling_common.py such as test_cpu_offload or test_model_parallism compares the first element of the output dictionary returned by the model (aka base_output[0] and new_output[0] ). This is not correct in the case of ViLT since in most of the cases, the first element of the output list is a hidden_state which appears to be stochastic according to this line. This can be also confirmed also by running several inferences and see that the hidden states are not the same. This is because VilT samples image tokens from a multinomial distribution according to the linked line above.
However, you can still check logits correctness using logits and pooler_output instead of the hidden states, therefore I added those in this PR.
The documentation is not available anymore as the PR was closed or merged.
Thank you, @younesbelkada
Let me bother you a bit more: why logits and pooler_output are deterministic while hidden_states is not. Strange to me, but I must miss some details in this model.
No worries!
After deeply looking at the code it appears that the pooler and the other heads that are used by the model take as an input the first element of the variable hidden_states (with the latest having the shape batch_sizex nb_tokens x hidden_dim). With the hidden states containing an embedding for each image patch (stochastic) concatenated together with the text embeddings (non stochastic). Therefore since the heads uses the first text embedding (or CLS token) to perform its downstream task, there is no stochasticity involved there. The stochasticity is only involved on the image patch embedding side.
This can be confirmed as well from the model achitecture (taken from https://arxiv.org/pdf/2102.03334.pdf):
Although one can always take the first hidden states to perform the accelerate tests, I do think the proposed fix is slightly better since it checks the output of modules that would not be checked if we consider only the first hidden states (e.g. Pooler).
Should still be deterministic from my intuition, let me have a look
After looking into it with @ArthurZucker fixing a manual seed on each accelerate test before each forward pass seems to fix the non passing tests. We can either keep the changes as they are, or re-define the tests with a seed that is set before each forward pass
I am open to the fix of using seed. It makes the change smaller. But I am confused by the fact that the daily scheduled CI never report this issue (I should check on slack channels too).
https://github.com/huggingface/transformers/runs/7891383348?check_suite_focus=true
The nightly CI you mentioned above uses nightly torch version (i.e. daily built torch) for which more test failures are expected.
I think it's better for us to figure out why the relevant tests don't fail on scheduled CI for a long period, but here it fails. This seems contradictory.
Could you mention on which hardware you tested, @younesbelkada ?
For the accelerate tests it is normal that they were passing since the _no_split_modules attribute was never defined in the model class, therefore these tests were never run.
Regarding the second test maybe it's a hardware issue yes! Let me print you the details:
Python version: 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]
transformers version: 4.22.0.dev0
Torch version: 1.12.1
Cuda available: True
Cuda version: 11.3
CuDNN version: 8302
Number of GPUs available: 2
NCCL version: (2, 10, 3)
DeepSpeed version: None
TensorFlow version: 2.9.1
TF GPUs available: True
Number of TF GPUs available: 2
The non passing test pass on my VM with atol=4e-2 by the way
Thanks a lot for your comments @ydshieh & @NielsRogge
@NielsRogge : I can confirm this decorator works fine for ViTMAE too, I replaced the function you pointed me by:
@set_reproducible(2)
def test_save_load(self):
super().test_save_load()
And the test was passing so I guess we can safely replace stochastic tests with this kind of trick
I would love to have a quick review from @sgugger or @stas00 if possible 🙏 As I think that this decorator can be useful for future models
Thanks 💪
Just a small comment, in terms of performances I think the decorator can be a little bit improve to only run on the model's forward and not on every single forward pass (if I understand correctly here all the functions named forward are affected, instead of just the model's main forward pass)
I think that @ArthurZucker wanted to enhance the util to make it more memory efficient, will let him decide whenever he feels it ready to merge 💪
I can confirm that it now fixes the tests for vit_mae as suggested by @NielsRogge. Just added a quick fix to only set the seed on the model's forward. LGTM
Thank you all for your comments!
Reverted the changes related to the context manager and kept the tests to be as simple as possible!
Can confirm all the slow tests pass now (expect for this test but this is expected as stated in the comment)
I propose to address the stochasticity issue potentially in a follow-up PR, and keep this PR only for its main goal: add ViLT support for accelerate
@sgugger thanks for your comment! And sorry for my late reply
I should have addressed the proposed suggestion in the commit 43b087a and can confirm the slow tests for this model are passing! I hesitated between this solution and having an attribute on ModelTesterMixin but I thought that this solution is simpler to understand for future users (the ModelTesterMixin has already a lot of attributes)
I can also take care of opening a follow-up PR to update the tests of ViTMAE to make sure these changes are consistent for stochastic models tests inside transformers
Perfect thanks! @sgugger Should have addressed the changes now! Let me know if you still need some modifications