transformers NaN in XGLM Softmax with FP16

System Info

transformers version: 4.21.0.dev0
Platform: Linux-5.3.0-1017-x86_64-with-glibc2.27
Python version: 3.9.13
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.12.0+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes, 4x V100

Who can help?

@ydshieh @patrickvonplaten

Reproduction

Related to the fixes in #17437 most likely.

I am using an example similar to test_batched_nan_fp16 in test_modeling_opt.py, but for an XGLM model. The only difference with that test is the torch.cuda.amp.autocast usage, which I found necessary to perform inference (otherwise I would get an error saying "expected scalar type Float but found Half" coming from the forward of XGLM)

import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

# Tested with xglm-564M and 7.5B (the second using `infer_auto_device_map` and 
# `load_checkpoint_and_dispatch` from `accelerate`.
tokenizer = AutoTokenizer.from_pretrained("facebook/xglm-564M", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("facebook/xglm-564M", torch_dtype=torch.float16, use_cache=True).cuda()

batch = tokenizer(["Who are you?", "Joe Biden is the president of"], padding=True, return_tensors="pt")

input_ids = batch["input_ids"].cuda()
attention_mask = batch["attention_mask"].cuda()

with torch.no_grad():
    with torch.cuda.amp.autocast():
        outputs = model(input_ids, attention_mask=attention_mask)
        assert not torch.isnan(outputs.logits[0]).any().item() # Raises an AssertionError

Expected behavior

I would expect the model to have normal logits when using FP16. The spotting of this bug was prompted by an issue of garbage generation when doing batching, despite the left padding and a valid attention mask.

Jul 06 '22 23:07 gsarti

cc @patil-suraj for XLGM

Jul 07 '22 12:07 patrickvonplaten

From padding=left in the code snippet, I would guess this is similar to issue #17433 and the fix in #17437 should fix it. @gsarti Would you like to try it and maybe also open a PR?

P.S I actually thought yesterday if I should apply #17437 to all models, and tried a few models like GPT2, Bart which are fine. Maybe it is indeed better to do it for all models.

Jul 07 '22 13:07 ydshieh

Opened PR #18057 with the suggested fix @ydshieh, should be good to go!

Jul 07 '22 14:07 gsarti

Can we close this one? cc @ydshieh ?

Sep 27 '22 11:09 patrickvonplaten

I re-opened that PR #18057. Let's see if @gsarti would like to continue the work, otherwise I can take it. The necessary fix is minimal.

Sep 27 '22 14:09 ydshieh

transformers transformers copied to clipboard

NaN in XGLM Softmax with FP16

System Info

Who can help?

Reproduction

Expected behavior

transformers
transformers copied to clipboard