NaN when training t5-large with bf16 on multiple GPUs
System Info
transformersversion: 4.20.1- Platform: Linux-5.15.0-1017-gcp-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.8.1
- PyTorch version (GPU?): 1.12.0+cu113 (True)
- Tensorflow version (GPU?): 2.4.4 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes
Who can help?
@patrickvonplaten
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
I'm getting nan immediately when training t5-large using bfloat16 on multiple GPUs, but when I run the same script on a single GPU it's fine. I've made a small example below, which I'm running on a machine with 2 A100s. If I do CUDA_VISIBLE_DEVICES=0 python script.py the loss is fine, but if I just do python script.py I get nan from the first iteration.
from typing import List, Tuple
import torch
from torch.utils.data import Dataset, DataLoader
import transformers
class MyDataset(Dataset):
def __init__(
self,
data: List[List[str]],
tokenizer: transformers.PreTrainedTokenizerFast,
) -> None:
super().__init__()
self._data = data
self._tokenizer = tokenizer
def __len__(
self,
) -> int:
return len(self._data)
def __getitem__(
self,
index: int
) -> List[str]:
return self._data[index]
def collate_fn(
self,
batch: List[List[str]],
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
prompts = [b[0] for b in batch]
targets = [b[1] for b in batch]
prompts_tokenized = self._tokenizer(
text=prompts,
padding=True,
return_tensors="pt",
return_attention_mask=True,
)
prompts_input_ids = prompts_tokenized["input_ids"]
prompts_attention_mask = prompts_tokenized["attention_mask"]
targets_tokenized = self._tokenizer(
text=targets,
padding=True,
return_tensors="pt",
return_attention_mask=True,
)
targets_input_ids = targets_tokenized["input_ids"]
targets_attention_mask = targets_tokenized["attention_mask"]
return (
prompts_input_ids,
prompts_attention_mask,
targets_input_ids,
targets_attention_mask,
)
if __name__ == "__main__":
model = transformers.T5ForConditionalGeneration.from_pretrained(
"t5-large",
)
tokenizer = transformers.T5TokenizerFast.from_pretrained(
"t5-large",
)
device = (
torch.device("cuda:0")
if torch.cuda.is_available()
else torch.device("cpu")
)
multi_gpu = torch.cuda.device_count() > 1
if multi_gpu:
model = torch.nn.DataParallel(model)
model = model.to(device)
optimizer = transformers.Adafactor(
params=model.parameters(),
lr=1e-4,
scale_parameter=False,
relative_step=False,
)
grad_scaler = torch.cuda.amp.GradScaler(
enabled=True,
)
my_data = [
[f"This is sentence {i}.", f"This is sentence {i + 1}."]
for i in range(1000000)
]
dataset = MyDataset(
data=my_data,
tokenizer=tokenizer,
)
dataloader = DataLoader(
dataset=dataset,
batch_size=8,
shuffle=True,
collate_fn=dataset.collate_fn,
)
for batch in dataloader:
with torch.autocast(
enabled=True,
device_type=device.type,
dtype=torch.bfloat16,
):
batch = [b.to(device) for b in batch]
(
prompts_input_ids,
prompts_attention_mask,
targets_input_ids,
targets_attention_mask,
) = batch
loss = model(
input_ids=prompts_input_ids,
attention_mask=prompts_attention_mask,
labels=targets_input_ids,
).loss
if multi_gpu:
loss = loss.mean()
grad_scaler.scale(loss).backward()
grad_scaler.step(optimizer)
grad_scaler.update()
optimizer.zero_grad()
print(f"Loss = {loss.item()}")
Expected behavior
No nans when training t5-large using bfloat16 on multiple GPUs.
@LysandreJik perhaps you could suggest someone who can help with this please?
I believe @stas00 has some experience around bfloat16 and nans and may have an idea of where the issue may be coming from
I have tried t5-large, tested your script to work fine with t5-small - need to find a box with a few large gpus to test t5-large.
Meanwhile, we should revisit the scaling.
the main benefit of using bf16 over fp16 is that there is very little risk of overflow - since bf16's numerical range is the same as of fp32, so no down scaling is needed here.
But perhaps we are hitting underflow here. There is a special tool we have for that - you can try to plug it in and observe where (most likely) underflow is happening https://huggingface.co/docs/transformers/debugging#underflow-and-overflow-detection
But then underflow would just lead to no learning and not really nan I think.
I will try to experiment more with it once I'm able to run t5-large.
Thanks @stas00 - I had a go at using the underflow/overflow detection tool but actually when I switched from DataParallel to DistributedDataParallel I didn't get nans with this toy example! I'll try to do some experiments with some real data next week and let you know if this solves it.
oh, great, then I don't need to look for a set of large GPUs :) Thank you for this update, @harshil-shah!
Indeed please do let us know when you get a chance to experiment
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.