transformers
transformers copied to clipboard
NaN in XGLM Softmax with FP16
System Info
-
transformers
version: 4.21.0.dev0 - Platform: Linux-5.3.0-1017-x86_64-with-glibc2.27
- Python version: 3.9.13
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes, 4x V100
Who can help?
@ydshieh @patrickvonplaten
Reproduction
Related to the fixes in #17437 most likely.
I am using an example similar to test_batched_nan_fp16
in test_modeling_opt.py
, but for an XGLM model. The only difference with that test is the torch.cuda.amp.autocast
usage, which I found necessary to perform inference (otherwise I would get an error saying "expected scalar type Float but found Half" coming from the forward of XGLM)
import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
# Tested with xglm-564M and 7.5B (the second using `infer_auto_device_map` and
# `load_checkpoint_and_dispatch` from `accelerate`.
tokenizer = AutoTokenizer.from_pretrained("facebook/xglm-564M", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("facebook/xglm-564M", torch_dtype=torch.float16, use_cache=True).cuda()
batch = tokenizer(["Who are you?", "Joe Biden is the president of"], padding=True, return_tensors="pt")
input_ids = batch["input_ids"].cuda()
attention_mask = batch["attention_mask"].cuda()
with torch.no_grad():
with torch.cuda.amp.autocast():
outputs = model(input_ids, attention_mask=attention_mask)
assert not torch.isnan(outputs.logits[0]).any().item() # Raises an AssertionError
Expected behavior
I would expect the model to have normal logits when using FP16. The spotting of this bug was prompted by an issue of garbage generation when doing batching, despite the left padding and a valid attention mask.
cc @patil-suraj for XLGM
From padding=left
in the code snippet, I would guess this is similar to issue #17433 and the fix in #17437 should fix it. @gsarti Would you like to try it and maybe also open a PR?
P.S I actually thought yesterday if I should apply #17437 to all models, and tried a few models like GPT2, Bart which are fine. Maybe it is indeed better to do it for all models.
Opened PR #18057 with the suggested fix @ydshieh, should be good to go!
Can we close this one? cc @ydshieh ?
I re-opened that PR #18057. Let's see if @gsarti would like to continue the work, otherwise I can take it. The necessary fix is minimal.