What does this PR do?

Identical inputs to GPT-J and CodeGen models will currently generate different outputs if they are padded differently (for example in a batch of variable sequence lengths).

This PR reverts the recent change #21869 that removes GPT-J position_ids, and then applies similar changes as were done for GPT-J XLA in #17986.

~One copy of the precomputed position embeddings is shared between all of the layers.~

Related issue: #21080

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@gante

Mar 10 '23 03:03 njhill

I have tested this with my own code/usecase but wanted to check that there is interest in the contribution before also updating any applicable unit tests.

I also wonder whether there should be a universal test applied to all models that just tests the same input with different amounts of padding and makes sure that the output is identical?

Mar 10 '23 03:03 njhill

@njhill and yes, the contribution is deeply appreciated! 🙏

Be mindful that this will not result in making the outputs left-padding agnostic. As in all models, the padding is a numerical mask. In FP32, it is almost left-padding agnostic, but in FP16/BF16/INT8 the left-padding may introduce changes :)

Mar 10 '23 11:03 gante

The documentation is not available anymore as the PR was closed or merged.

Mar 10 '23 16:03 HuggingFaceDocBuilderDev

@gante I didn't review this PR, but I see this is related to issue #21080, and therefore to PR #21853 indirectly, which was reverted in #22093 due to some unexpected tests failure (PT/TF, PT/Flax).

So before merging this PR, it's better to verify the cross tests, as well as the slow tests too (always better).

The PR CI for #21853 was green (and also green when merged to main), but some tests started to fail in subsequent PRs. It's unclear to us why we didn't catch these in the PR CI though.

Mar 11 '23 12:03 ydshieh

Thanks for the heads up @ydshieh! 🙏

I'll make sure all related slow tests (and the tests that failed after merging #21853 ) are passing before merging.

Mar 11 '23 15:03 gante

Thanks @gante ... I'm kind of new to this but will figure out how to verify/update the tests per your request.

The main problem I've run into though is newly-failing torch.fx tracing tests:

FAILED tests/models/gptj/test_modeling_gptj.py::GPTJModelTest::test_torch_fx - AssertionError: Couldn't trace module: Proxy object cannot be iterated. This can be attempted when the Proxy is used in a loop or as a *args or **kwargs function argument. See the torch.fx docs on pytorch.org for a more detailed explanation of what types of control flow can be traced, and check out the Proxy docstring for help troubleshooting Proxy iteration errors
FAILED tests/models/gptj/test_modeling_gptj.py::GPTJModelTest::test_torch_fx_output_loss - AssertionError: Couldn't trace module: Proxy object cannot be iterated. This can be attempted when the Proxy is used in a loop or as a *args or **kwargs function argument. See the torch.fx docs on pytorch.org for a more detailed explanation of what types of control flow can be traced, and check out the Proxy docstring for help troubleshooting Proxy iteration errors

I've tried some different variations to the logic but always end up with similar kind of errors. I think it may stem from the index_select operation. Any pointers/ideas would be appreciated!

Mar 12 '23 00:03 njhill

Hey @njhill 👋

I've tried to fix the issue you mentioned with no success. It seems like we are between a rock and a hard place -- the changes you made, by design, make sincos dependent on the values of position_ids. In other words, sincos becomes a tensor impossible to predict at compile time with torch.fx, i.e. dynamic tensor. Ultimately, no matter how we rewrite the code (AFAIK), we will hit this barrier, causing the test to fail.

@sgugger @fxmarty is there a way we can make torch.fx ignore a function? (or do you have suggestions?) The change in this PR makes GPT-J correct in the presence of left-padding, but breaks compatibility with torch.fx 🙈

(Pastebin containing the code with modifications, working through the exceptions until I got stuck: https://pastebin.com/T0HpD07C)

Mar 13 '23 17:03 gante

Also cc @michaelbenayoun for torch fx.

Mar 13 '23 17:03 sgugger

Thanks @gante, it sounds like you followed a similar path to me w.r.t. trying different arrangements of the logic to get around this. I was guessing this couldn't be the only occurrence of this dynamic tensor issue in the library - is dynamic slicing done elsewhere and if so how does it work with torch.fx?

Mar 13 '23 20:03 njhill

Hi @njhill,

The issue here (from what I could understand from this), seems to be that during tracing we do not have regular tensors but rather symbolic "proxies".

In the following code we are trying to call __iter__ on sincos which is symbolic, we do not know its length (again, not 100% sure but guessing).

sincos = [t.contiguous() for t in sincos]

But the previous line is :

sincos = torch.split(sincos, sincos.shape[-1] // 2, dim=-1)

Meaning that the list has:

2 elements if sincos.shape[-1] is an even number
3 elements if sincos.shape[-1] is an odd number.

So could you try this:

sincos = torch.split(sincos, sincos.shape[-1] // 2, dim=-1)
len_sincos = 2 + torch.remainder(torch.tensor(sincos.shape[-1], 2))
sincos = [sincos[idx].contiguous() for idx in torch.arange(len_sincos)]

Tell me if this works!

Mar 14 '23 10:03 michaelbenayoun

Thanks @michaelbenayoun. You are right that this seems to be the fact that a symbolic proxy tensor is introduced somewhere, however I think that this stems from the tensor-based indexing here:

sincos = embed_positions[position_ids]

The proxy iterator errors are easy to circumvent but just move the problem until later where (inevitably?) the size of the proxy tensor is used for flow control. I've pushed a couple of small updates to the PR to demonstrate this... you can see the latest error in the tests here. As @gante pointed out above:

Ultimately, no matter how we rewrite the code (AFAIK), we will hit this barrier, causing the test to fail.

Could we at least make this path conditional such that it isn't followed in the torch.fx case, i.e. declare that variable padding is unsupported in that case?

Mar 14 '23 16:03 njhill

Hey @njhill -- I think the conditional path is a sensible idea, at least for now (we can always revisit it later). #22161 reports a similar problem on another demanded model, so I would like to merge the fix as soon as possible 🤗

For context, other places in the transformers do this sort of conditional paths for torch.fx. Check here for an example.

Mar 15 '23 09:03 gante

@njhill The HF tracer is supposed to keep track of "concrete" metadata during tracing to allow for that. In this case, either this does not work with len, which is possible (I do not remember tbh), or it means than an op does not support the meta device, hence breaking the concrete metadata accumulation.

Since in this case you are trying to check the rank of the tensor, could you try replacing len(tensor.shape) by tensor.ndim?

Mar 17 '23 10:03 michaelbenayoun

Thanks @michaelbenayoun .. the len problem can be avoided by adding torch.fx.wrap('len'), which I'd done in the prior commit but removed in this latest commit since it seemed futile (just moving the error slightly later). So I was instead attempting to bypass the position_ids fix in the torch.fx case per this comment (so far unsuccessfully).

The problem encountered after working around the len problem can be seen here:

>       if len(tensor.shape) == 5:

AssertionError: Couldn't trace module: symbolically traced variables cannot be used as inputs to control flow

basically this traced length value is then used in a control flow condition.

Mar 18 '23 00:03 njhill

@gante @michaelbenayoun I've got torch.fx to work with the changes now by using torch.gather instead of tensor based indexing and adding a couple of new tensor methods to the metadata tracking in fx.py.

Also rebased on latest main branch since some other CI tests started to fail I think related to a recently-merged unrelated change.

I will look into the requested additional tests next when I get a chance.

Mar 19 '23 15:03 njhill

For our future reference, here's a snippet that shows that left-padding is fixed with these changes:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.bfloat16).to(0)
tok.pad_token = tok.eos_token
model.generation_config.pad_token_id = model.generation_config.eos_token_id

inputs_1 = tok(["The brown fox"], return_tensors="pt", padding=True).to(0)
out_1 = model(**inputs_1)
out_2 = model(**inputs_1)

position_ids = torch.cumsum(inputs_1.attention_mask, dim=-1) - 1
out_3 = model(**inputs_1, position_ids=position_ids + 8)

inputs_2 = tok(["The brown fox"], return_tensors="pt", padding="max_length", max_length=10).to(0)
out_4 = model(**inputs_2)

position_ids = torch.cumsum(inputs_2.attention_mask, dim=-1) - 1
position_ids.masked_fill_(inputs_2.attention_mask == 0, 1)
out_5 = model(**inputs_2, position_ids=position_ids)

# calls with the same inputs get the same logits
print(torch.max(torch.abs(out_1.logits[:, -1, :] - out_2.logits[:, -1, :]))) # tensor(0., device='cuda:0', grad_fn=<MaxBackward1>)

# changing the position_ids changes the logits
print(torch.max(torch.abs(out_1.logits[:, -1, :] - out_3.logits[:, -1, :]))) # tensor(0.0625, device='cuda:0', grad_fn=<MaxBackward1>)

# padding and not passing position ids -> incorrect position ids -> output differences
print(torch.max(torch.abs(out_1.logits[:, -1, :] - out_4.logits[:, -1, :]))) # tensor(0.0625, device='cuda:0', grad_fn=<MaxBackward1>)

# left-padding has a much smaller impact (NOTE: setting e.g. `max_length=20` will cause the next diff to be non-zero.
# Numerical masking is not perfect :) )
print(torch.max(torch.abs(out_1.logits[:, -1, :] - out_5.logits[:, -1, :]))) # tensor(0., device='cuda:0', grad_fn=<MaxBackward1>)

Mar 21 '23 11:03 gante

The failing CI was fixed in this merged PR, merging.

Mar 22 '23 11:03 gante

@njhill fantastic work with the torch.fx, I really appreciated your effort 🤗

Mar 22 '23 11:03 gante

Thanks @gante, glad I was able to contribute. Thank you for your fast responses and for all the great work you and team do.

Mar 22 '23 14:03 njhill

This PR isn't backward compatible. It breaks with pytorch-1.8:

E             File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/gptj/modeling_gptj.py", line 63, in <module>
E               @torch.fx.wrap
E           AttributeError: module 'torch' has no attribute 'fx'

not sure if you want to revert this or have an idea how to overcome this quickly.

Mar 22 '23 23:03 stas00

This PR isn't backward compatible. It breaks with pytorch-1.8:
E             File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/gptj/modeling_gptj.py", line 63, in <module>
E               @torch.fx.wrap
E           AttributeError: module 'torch' has no attribute 'fx'
not sure if you want to revert this or have an idea how to overcome this quickly.

@stas00

FYI, see #22291, although that PR and this PR is not directly related from the beginning when they are opened.

Mar 22 '23 23:03 ydshieh

ok, the deepspeed CI is running pt-1.8 - how do we solve that then?

Mar 22 '23 23:03 stas00

ok, the deepspeed CI is running pt-1.8 - how do we solve that then?

I just saw

https://github.com/microsoft/DeepSpeed/pull/3082

opened 2 hours ago. I am not sure what will go, but I will try to follow tomorrow morning.

Mar 22 '23 23:03 ydshieh

oh, ok, I guess everything is fine then. thank you for the heads up, @ydshieh

Mar 22 '23 23:03 stas00

it still fails with pt-1.9.1

you need import torch.fx (thanks @mrwyattii)
it then fails with:

E             File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/gptj/modeling_gptj.py", line 61, in create_sinusoidal_positions
E               return torch.concat((torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)), dim=1)
E           AttributeError: module 'torch' has no attribute 'concat'

Mar 22 '23 23:03 stas00

Oops, I guess we should use torch.cat() instead

Mar 22 '23 23:03 njhill

and it fails w/o import torch.fx

E             File "/mnt/nvme0/code/huggingface/transformers-master/examples/pytorch/language-modeling/run_clm.py", line 412, in main
E               model = AutoModelForCausalLM.from_pretrained(
E             File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/auto/auto_factory.py", line 470, in from_pretrained
E               model_class = _get_model_class(config, cls._model_mapping)
E             File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/auto/auto_factory.py", line 360, in _get_model_class
E               supported_models = model_mapping[type(config)]
E             File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/auto/auto_factory.py", line 602, in __getitem__
E               return self._load_attr_from_module(model_type, model_name)
E             File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/auto/auto_factory.py", line 616, in _load_attr_from_module
E               return getattribute_from_module(self._modules[module_name], attr)
E             File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/auto/auto_factory.py", line 561, in getattribute_from_module
E               if hasattr(module, attr):
E             File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/utils/import_utils.py", line 1109, in __getattr__
E               module = self._get_module(self._class_to_module[name])
E             File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/utils/import_utils.py", line 1121, in _get_module
E               raise RuntimeError(
E           RuntimeError: Failed to import transformers.models.gptj.modeling_gptj because of the following error (look up to see its traceback):
E           module 'torch' has no attribute 'fx'

so 2 fixes at least. thank you!

Mar 22 '23 23:03 stas00

I confirm that it works with torch.cat

perhaps use torch.concat but add an alias:

# bc for pt<1.10
if not getattr(torch, "concat"):
    torch.concat = torch.cat

stashed somewhere in utils?

Mar 23 '23 00:03 stas00

import torch.fx is a must - even with pt-1.10 it won't work w/o it.

Mar 23 '23 00:03 stas00

@njhill, are you on top of fixing this?

This is a bit urgent since Deepspeed CI uses our bleed edge to test deepspeed bleed edge on live CI. and currently their CI breaks because of this breakage.

Mar 23 '23 00:03 stas00

transformers
transformers copied to clipboard

Fix position embeddings for GPT-J and CodeGen

What does this PR do?

Before submitting

Who can review?

transformers transformers copied to clipboard

Fix position embeddings for GPT-J and CodeGen

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard