transformers
transformers copied to clipboard
Fix position embeddings for GPT-J and CodeGen
What does this PR do?
Identical inputs to GPT-J and CodeGen models will currently generate different outputs if they are padded differently (for example in a batch of variable sequence lengths).
This PR reverts the recent change #21869 that removes GPT-J position_ids, and then applies similar changes as were done for GPT-J XLA in #17986.
~One copy of the precomputed position embeddings is shared between all of the layers.~
Related issue: #21080
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@gante
I have tested this with my own code/usecase but wanted to check that there is interest in the contribution before also updating any applicable unit tests.
I also wonder whether there should be a universal test applied to all models that just tests the same input with different amounts of padding and makes sure that the output is identical?
@njhill and yes, the contribution is deeply appreciated! 🙏
Be mindful that this will not result in making the outputs left-padding agnostic. As in all models, the padding is a numerical mask. In FP32, it is almost left-padding agnostic, but in FP16/BF16/INT8 the left-padding may introduce changes :)
The documentation is not available anymore as the PR was closed or merged.
@gante I didn't review this PR, but I see this is related to issue #21080, and therefore to PR #21853 indirectly, which was reverted in #22093 due to some unexpected tests failure (PT/TF, PT/Flax).
So before merging this PR, it's better to verify the cross tests, as well as the slow tests too (always better).
The PR CI for #21853 was green (and also green when merged to main), but some tests started to fail in subsequent PRs. It's unclear to us why we didn't catch these in the PR CI though.
Thanks for the heads up @ydshieh! 🙏
I'll make sure all related slow tests (and the tests that failed after merging #21853 ) are passing before merging.
Thanks @gante ... I'm kind of new to this but will figure out how to verify/update the tests per your request.
The main problem I've run into though is newly-failing torch.fx tracing tests:
FAILED tests/models/gptj/test_modeling_gptj.py::GPTJModelTest::test_torch_fx - AssertionError: Couldn't trace module: Proxy object cannot be iterated. This can be attempted when the Proxy is used in a loop or as a *args or **kwargs function argument. See the torch.fx docs on pytorch.org for a more detailed explanation of what types of control flow can be traced, and check out the Proxy docstring for help troubleshooting Proxy iteration errors
FAILED tests/models/gptj/test_modeling_gptj.py::GPTJModelTest::test_torch_fx_output_loss - AssertionError: Couldn't trace module: Proxy object cannot be iterated. This can be attempted when the Proxy is used in a loop or as a *args or **kwargs function argument. See the torch.fx docs on pytorch.org for a more detailed explanation of what types of control flow can be traced, and check out the Proxy docstring for help troubleshooting Proxy iteration errors
I've tried some different variations to the logic but always end up with similar kind of errors. I think it may stem from the index_select operation. Any pointers/ideas would be appreciated!
Hey @njhill 👋
I've tried to fix the issue you mentioned with no success. It seems like we are between a rock and a hard place -- the changes you made, by design, make sincos dependent on the values of position_ids. In other words, sincos becomes a tensor impossible to predict at compile time with torch.fx, i.e. dynamic tensor. Ultimately, no matter how we rewrite the code (AFAIK), we will hit this barrier, causing the test to fail.
@sgugger @fxmarty is there a way we can make torch.fx ignore a function? (or do you have suggestions?) The change in this PR makes GPT-J correct in the presence of left-padding, but breaks compatibility with torch.fx 🙈
(Pastebin containing the code with modifications, working through the exceptions until I got stuck: https://pastebin.com/T0HpD07C)
Also cc @michaelbenayoun for torch fx.
Thanks @gante, it sounds like you followed a similar path to me w.r.t. trying different arrangements of the logic to get around this. I was guessing this couldn't be the only occurrence of this dynamic tensor issue in the library - is dynamic slicing done elsewhere and if so how does it work with torch.fx?
Hi @njhill,
The issue here (from what I could understand from this), seems to be that during tracing we do not have regular tensors but rather symbolic "proxies".
In the following code we are trying to call __iter__ on sincos which is symbolic, we do not know its length (again, not 100% sure but guessing).
sincos = [t.contiguous() for t in sincos]
But the previous line is :
sincos = torch.split(sincos, sincos.shape[-1] // 2, dim=-1)
Meaning that the list has:
- 2 elements if
sincos.shape[-1]is an even number - 3 elements if
sincos.shape[-1]is an odd number.
So could you try this:
sincos = torch.split(sincos, sincos.shape[-1] // 2, dim=-1)
len_sincos = 2 + torch.remainder(torch.tensor(sincos.shape[-1], 2))
sincos = [sincos[idx].contiguous() for idx in torch.arange(len_sincos)]
Tell me if this works!
Thanks @michaelbenayoun. You are right that this seems to be the fact that a symbolic proxy tensor is introduced somewhere, however I think that this stems from the tensor-based indexing here:
sincos = embed_positions[position_ids]
The proxy iterator errors are easy to circumvent but just move the problem until later where (inevitably?) the size of the proxy tensor is used for flow control. I've pushed a couple of small updates to the PR to demonstrate this... you can see the latest error in the tests here. As @gante pointed out above:
Ultimately, no matter how we rewrite the code (AFAIK), we will hit this barrier, causing the test to fail.
Could we at least make this path conditional such that it isn't followed in the torch.fx case, i.e. declare that variable padding is unsupported in that case?
Hey @njhill -- I think the conditional path is a sensible idea, at least for now (we can always revisit it later). #22161 reports a similar problem on another demanded model, so I would like to merge the fix as soon as possible 🤗
For context, other places in the transformers do this sort of conditional paths for torch.fx. Check here for an example.
@njhill The HF tracer is supposed to keep track of "concrete" metadata during tracing to allow for that.
In this case, either this does not work with len, which is possible (I do not remember tbh), or it means than an op does not support the meta device, hence breaking the concrete metadata accumulation.
Since in this case you are trying to check the rank of the tensor, could you try replacing len(tensor.shape) by tensor.ndim?
Thanks @michaelbenayoun .. the len problem can be avoided by adding torch.fx.wrap('len'), which I'd done in the prior commit but removed in this latest commit since it seemed futile (just moving the error slightly later). So I was instead attempting to bypass the position_ids fix in the torch.fx case per this comment (so far unsuccessfully).
The problem encountered after working around the len problem can be seen here:
> if len(tensor.shape) == 5:
AssertionError: Couldn't trace module: symbolically traced variables cannot be used as inputs to control flow
basically this traced length value is then used in a control flow condition.
@gante @michaelbenayoun I've got torch.fx to work with the changes now by using torch.gather instead of tensor based indexing and adding a couple of new tensor methods to the metadata tracking in fx.py.
Also rebased on latest main branch since some other CI tests started to fail I think related to a recently-merged unrelated change.
I will look into the requested additional tests next when I get a chance.
For our future reference, here's a snippet that shows that left-padding is fixed with these changes:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.bfloat16).to(0)
tok.pad_token = tok.eos_token
model.generation_config.pad_token_id = model.generation_config.eos_token_id
inputs_1 = tok(["The brown fox"], return_tensors="pt", padding=True).to(0)
out_1 = model(**inputs_1)
out_2 = model(**inputs_1)
position_ids = torch.cumsum(inputs_1.attention_mask, dim=-1) - 1
out_3 = model(**inputs_1, position_ids=position_ids + 8)
inputs_2 = tok(["The brown fox"], return_tensors="pt", padding="max_length", max_length=10).to(0)
out_4 = model(**inputs_2)
position_ids = torch.cumsum(inputs_2.attention_mask, dim=-1) - 1
position_ids.masked_fill_(inputs_2.attention_mask == 0, 1)
out_5 = model(**inputs_2, position_ids=position_ids)
# calls with the same inputs get the same logits
print(torch.max(torch.abs(out_1.logits[:, -1, :] - out_2.logits[:, -1, :]))) # tensor(0., device='cuda:0', grad_fn=<MaxBackward1>)
# changing the position_ids changes the logits
print(torch.max(torch.abs(out_1.logits[:, -1, :] - out_3.logits[:, -1, :]))) # tensor(0.0625, device='cuda:0', grad_fn=<MaxBackward1>)
# padding and not passing position ids -> incorrect position ids -> output differences
print(torch.max(torch.abs(out_1.logits[:, -1, :] - out_4.logits[:, -1, :]))) # tensor(0.0625, device='cuda:0', grad_fn=<MaxBackward1>)
# left-padding has a much smaller impact (NOTE: setting e.g. `max_length=20` will cause the next diff to be non-zero.
# Numerical masking is not perfect :) )
print(torch.max(torch.abs(out_1.logits[:, -1, :] - out_5.logits[:, -1, :]))) # tensor(0., device='cuda:0', grad_fn=<MaxBackward1>)
The failing CI was fixed in this merged PR, merging.
@njhill fantastic work with the torch.fx, I really appreciated your effort 🤗
Thanks @gante, glad I was able to contribute. Thank you for your fast responses and for all the great work you and team do.
This PR isn't backward compatible. It breaks with pytorch-1.8:
E File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/gptj/modeling_gptj.py", line 63, in <module>
E @torch.fx.wrap
E AttributeError: module 'torch' has no attribute 'fx'
not sure if you want to revert this or have an idea how to overcome this quickly.
This PR isn't backward compatible. It breaks with pytorch-1.8:
E File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/gptj/modeling_gptj.py", line 63, in <module> E @torch.fx.wrap E AttributeError: module 'torch' has no attribute 'fx'not sure if you want to revert this or have an idea how to overcome this quickly.
@stas00
FYI, see #22291, although that PR and this PR is not directly related from the beginning when they are opened.
ok, the deepspeed CI is running pt-1.8 - how do we solve that then?
ok, the deepspeed CI is running pt-1.8 - how do we solve that then?
I just saw
https://github.com/microsoft/DeepSpeed/pull/3082
opened 2 hours ago. I am not sure what will go, but I will try to follow tomorrow morning.
oh, ok, I guess everything is fine then. thank you for the heads up, @ydshieh
it still fails with pt-1.9.1
-
you need
import torch.fx(thanks @mrwyattii) -
it then fails with:
E File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/gptj/modeling_gptj.py", line 61, in create_sinusoidal_positions
E return torch.concat((torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)), dim=1)
E AttributeError: module 'torch' has no attribute 'concat'
Oops, I guess we should use torch.cat() instead
and it fails w/o import torch.fx
E File "/mnt/nvme0/code/huggingface/transformers-master/examples/pytorch/language-modeling/run_clm.py", line 412, in main
E model = AutoModelForCausalLM.from_pretrained(
E File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/auto/auto_factory.py", line 470, in from_pretrained
E model_class = _get_model_class(config, cls._model_mapping)
E File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/auto/auto_factory.py", line 360, in _get_model_class
E supported_models = model_mapping[type(config)]
E File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/auto/auto_factory.py", line 602, in __getitem__
E return self._load_attr_from_module(model_type, model_name)
E File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/auto/auto_factory.py", line 616, in _load_attr_from_module
E return getattribute_from_module(self._modules[module_name], attr)
E File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/auto/auto_factory.py", line 561, in getattribute_from_module
E if hasattr(module, attr):
E File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/utils/import_utils.py", line 1109, in __getattr__
E module = self._get_module(self._class_to_module[name])
E File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/utils/import_utils.py", line 1121, in _get_module
E raise RuntimeError(
E RuntimeError: Failed to import transformers.models.gptj.modeling_gptj because of the following error (look up to see its traceback):
E module 'torch' has no attribute 'fx'
so 2 fixes at least. thank you!
I confirm that it works with torch.cat
perhaps use torch.concat but add an alias:
# bc for pt<1.10
if not getattr(torch, "concat"):
torch.concat = torch.cat
stashed somewhere in utils?
import torch.fx is a must - even with pt-1.10 it won't work w/o it.
@njhill, are you on top of fixing this?
This is a bit urgent since Deepspeed CI uses our bleed edge to test deepspeed bleed edge on live CI. and currently their CI breaks because of this breakage.