DeepSpeed
DeepSpeed copied to clipboard
[BUG] INFLIGHT parameters after evaluation
Describe the bug I adapted my training process from the hugging face trainer.py, so most of my trainer is similar to theirs. My model includes a language model and an external module for learning other parameters using information from the main model. I put the main and external parameters in separate groups in the optimizer. While testing the code on larger models with deepspeed, I encountered an assertion error after 50 training steps and 1 evaluation round. The error is related to the embedding matrix, which remains in "INFLIGHT" status after resuming training, while all other parameters are "AVAILABLE".
AssertionError: {'id': 786, 'status': 'INFLIGHT', 'numel': 38603520, 'ds_numel': 38603520, 'shape': (50265, 768), 'ds_shape': (50265, 768), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': {3}}
The program runs fine if I simply train a language model without an external module.
To Reproduce I can provide the code if necessary, but hope that I can get some help on how to further debug this issue :)
Expected behavior I expected the training runs fine after evaluation.
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Linux 4.18.0-425.13.1.el8_7.x86_64
- GPU count and types: 1 machine x2 A100s
- Interconnects (if applicable)
- Python version: python 3.8
Launcher context I am using deepspeed to launch my program.
I am experiencing the same issue on a single node using lightning
and a transformers.BertModel
at the second epoch of lightning's "sanity checking":
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() AssertionError: {'id': 0, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 23440896, 'shape': (0,), 'ds_shape': (30522, 768), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {702}}
system info
- OS: Ubuntu 20.04.5 LTS, kernel: Linux 5.15.0-1030-gcp, arch: x86-64
- GPU count and types: 1 x A100-SXM4-40GB
- Python version: 3.9.16
Relevant dependencies
-
lightning==2.0.0
-
transformers==4.27.1
-
deepspeed==0.8.3
-
torch==2.0.0
Hello @xiamengzhou @vlievin. Thank you for reporting the issue to us! This issue seems related to the parameter partitioning in ZeRO-3. Could you provide some scripts for us to reproduce this issue?
Hi @HeyangQin, my issue is definitely related with stage 3. I don't experience it using stage 2. I will try to isolate the bug in a small script next week.
Hi @HeyangQin , here is a small script with minimal code to reproduce my problem. I am not sure this problem is related to deepspeed
, it's might be an issue for lightning
: in the code below, if I comment the trainer.validate(...)
out, this works fine.
Minimal example
from typing import Any
import datasets
import deepspeed
import lightning.pytorch as pl
import torch
import transformers
from lightning.pytorch import strategies
class Collate:
def __init__(self, tokenizer, max_length=512):
self.tokenizer = tokenizer
self.max_length = max_length
def __call__(self, examples: list[dict[str, Any]]) -> dict[str, torch.Tensor]:
inputs = [example["question"] for example in examples]
encodings = self.tokenizer(
inputs,
padding="max_length",
truncation=True,
max_length=self.max_length,
return_tensors="pt",
)
return dict(encodings)
class DummyModel(pl.LightningModule):
def __init__(self, bert: transformers.BertModel):
super().__init__()
self.bert = bert
def forward(self, batch: dict[str, torch.Tensor]):
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
output = self.bert(input_ids, attention_mask=attention_mask)
loss = output.pooler_output.mean()
return loss
def training_step(self, batch: dict[str, Any], *args: Any, **kwargs: Any):
loss = self.forward(batch)
return loss
def validation_step(self, batch: dict[str, Any], *args: Any, **kwargs: Any):
loss = self.forward(batch)
return loss
def configure_optimizers(self):
return deepspeed.ops.adam.DeepSpeedCPUAdam(self.parameters(), lr=1e-5)
if __name__ == "__main__":
dataset = datasets.DatasetDict(dict(
train=datasets.load_dataset("squad", split="train[:1%]"),
validation=datasets.load_dataset("squad", split="validation[:1%]"),
))
model = DummyModel(transformers.AutoModel.from_pretrained("bert-base-uncased"))
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
train_dataloader = torch.utils.data.DataLoader(
dataset["train"],
batch_size=32,
collate_fn=Collate(tokenizer),
num_workers=4,
)
val_dataloader = torch.utils.data.DataLoader(
dataset["validation"],
batch_size=32,
collate_fn=Collate(tokenizer),
num_workers=4,
)
trainer = pl.Trainer(
devices=1,
accelerator="gpu",
precision="16-mixed",
strategy=pl.strategies.DeepSpeedStrategy(
stage=3,
offload_optimizer=True,
offload_parameters=True,
),
)
trainer.validate(model, dataloaders=val_dataloader) # <-- no problem when commenting this, the problem is probably related to `lightning`
trainer.fit(model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader)
Dependencies
Created an environment with
conda create --name=debug-deepspeed python=3.9
conda activate debug-deepspeed
pip install torch==2.0
pip install lightning==2.0
pip install transformers datasets deepspeed
pip freeze:
aiohttp==3.8.4
aiosignal==1.3.1
anyio==3.6.2
arrow==1.2.3
async-timeout==4.0.2
attrs==22.2.0
beautifulsoup4==4.12.0
blessed==1.20.0
certifi @ file:///croot/certifi_1671487769961/work/certifi
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.1
croniter==1.3.8
datasets==2.10.1
dateutils==0.6.12
deepdiff==6.3.0
deepspeed==0.8.3
dill==0.3.6
dnspython==2.3.0
email-validator==1.3.1
fastapi==0.88.0
filelock==3.10.7
frozenlist==1.3.3
fsspec==2023.3.0
h11==0.14.0
hjson==3.1.0
httpcore==0.16.3
httptools==0.5.0
httpx==0.23.3
huggingface-hub==0.13.3
idna==3.4
inquirer==3.1.3
itsdangerous==2.1.2
Jinja2==3.1.2
lightning==2.0.0
lightning-cloud==0.5.32
lightning-utilities==0.8.0
lit==16.0.0
markdown-it-py==2.2.0
MarkupSafe==2.1.2
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.14
networkx==3.0
ninja==1.11.1
numpy==1.24.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
ordered-set==4.1.0
orjson==3.8.8
packaging==23.0
pandas==1.5.3
psutil==5.9.4
py-cpuinfo==9.0.0
pyarrow==11.0.0
pydantic==1.10.7
Pygments==2.14.0
PyJWT==2.6.0
python-dateutil==2.8.2
python-dotenv==1.0.0
python-editor==1.0.4
python-multipart==0.0.6
pytorch-lightning==2.0.0
pytz==2023.2
PyYAML==6.0
readchar==4.0.5
regex==2023.3.23
requests==2.28.2
responses==0.18.0
rfc3986==1.5.0
rich==13.3.3
six==1.16.0
sniffio==1.3.0
soupsieve==2.4
starlette==0.22.0
starsessions==1.3.0
sympy==1.11.1
tokenizers==0.13.2
torch==2.0.0
torchmetrics==0.11.4
tqdm==4.65.0
traitlets==5.9.0
transformers==4.27.3
triton==2.0.0
typing_extensions==4.5.0
ujson==5.7.0
urllib3==1.26.15
uvicorn==0.21.1
uvloop==0.17.0
watchfiles==0.19.0
wcwidth==0.2.6
websocket-client==1.5.1
websockets==10.4
xxhash==3.2.0
yarl==1.8.2
Same issue encountered, I use pure DeepSpeed without lightning.
Any update or temporary workaround regarding this issue?
same issue encountered, this was happen when I use accelerator.backward(loss)
and use accelerator to get the model state dict accelerator.get_state_dict(model)
Same issue.
@HeyangQin can u pay attention about this? thx very much
I've been trying to come up with a small standalone repro for this, but unfortunately it's not entirely clear what triggers the behavior. A few things I've noticed though that seem consistent with what others have said:
- Only happens when using zero stage 3
- Is triggered specifically when the model is trained for an epoch, then evaluated, then trained for the next epoch (on the first batch of the second epoch)
- Is triggered in the forward pass when calling an embedding
A couple things that I was able to do to workaround the issue:
- Vary the batch size. Some batch sizes seem to trigger the issue, others don't.
- Avoid calling
model.eval()
. For some reason this will consistently prevent the error from happening for me.
Hello @vlievin. This issue has been fixed by a collaborative effort with the Lightning team. A special thanks to you for the nice reproduction script. It was super helpful. Please update both the deepspeed and lightning to apply the fix.
Thank you @HeyangQin, this fix is super helpful!
Thanks for the fix @HeyangQin! Are there plans to release a patch release next week containing this fix?
Hello @tgaddair. We plan to release a new deepspeed version by next week. I will keep you posted
I was still getting this error after updating to deep speed 0.9.3 and lightning 2.0.3.
it seems to be fixed when I forced my dataset (both train and Val!) size to be an exact multiple of batch size. This is consistent with one of @tgaddair observations.
Hi @pretzel583. Could you share a reproduce script for us to better investigate the issue? Thank you
Same issue.
What should be the values of batch_size and mini_batch_size respectively? @ pretzel583