DeepSpeed
DeepSpeed copied to clipboard
[BUG] pin_memory() raises an error when using pipeline parallelism
Describe the bug when I use pipeline parallelism and ZeRo-1 offload at the same time, there will be an error:
│ /mnt/petrelfs/zhangshuo/miniconda3/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/st │
│ age_1_and_2.py:485 in __init__ │
│ │
│ 482 │ │ self.dynamic_loss_scale = self.loss_scaler.dynamic │
│ 483 │ │ │
│ 484 │ │ see_memory_usage("Before initializing optimizer states", force=True) │
│ ❱ 485 │ │ self.initialize_optimizer_states() │
│ 486 │ │ see_memory_usage("After initializing optimizer states", force=True) │
│ 487 │ │ │
│ 488 │ │ if dist.get_rank() == 0: │
│ │
│ /mnt/petrelfs/zhangshuo/miniconda3/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/st │
│ age_1_and_2.py:615 in initialize_optimizer_states │
│ │
│ 612 │ │ │ print(single_grad_partition.dtype) │
│ 613 │ │ │ print(single_grad_partition.device) │
│ 614 │ │ │ print(single_grad_partition.shape) │
│ ❱ 615 │ │ │ self.single_partition_of_fp32_groups[i].grad = get_accelerator().pin_memory( │
│ 616 │ │ │ │ single_grad_partition) if self.cpu_offload else single_grad_partition │
│ 617 │ │ │
│ 618 │ │ self.optimizer.step() │
│ │
│ /mnt/petrelfs/zhangshuo/miniconda3/envs/dl/lib/python3.9/site-packages/deepspeed/accelerator/cud │
│ a_accelerator.py:217 in pin_memory │
│ │
│ 214 │ │ return torch.cuda.LongTensor │
│ 215 │ │
│ 216 │ def pin_memory(self, tensor): │
│ ❱ 217 │ │ return tensor.pin_memory() │
│ 218 │ │
│ 219 │ def on_accelerator(self, tensor): │
│ 220 │ │ device_str = str(tensor.device) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: OS call failed or operation not supported on this OS
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
And this error only happens when my model is really large (maybe 7 billion parameters).
To Reproduce
My code:
# model2.py
import torch
import torch.nn as nn
from torch.utils.data import Dataset
import deepspeed
from deepspeed.pipe import LayerSpec
from deepspeed.pipe import PipelineModule
from deepspeed.runtime.pipe.topology import PipeModelDataParallelTopology
class Layer1(nn.Module):
def __init__(self) -> None:
super().__init__()
self.linear = nn.Linear(1, 1000000000)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.linear(x)
class Layer2(nn.Module):
def __init__(self) -> None:
super().__init__()
self.linear = nn.Linear(1000000000, 1)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.linear(x)
class DummyDataset(Dataset):
def __init__(self):
pass
def __len__(self):
return 100
def __getitem__(self, idx):
return torch.tensor([idx]).float(), torch.tensor([idx]).float()
deepspeed.init_distributed(dist_backend='nccl', init_method="env://")
model = PipelineModule(
layers=[LayerSpec(Layer1), LayerSpec(Layer2)],
num_stages=2,
topology=PipeModelDataParallelTopology(num_pp=2, num_dp=1, num_mp=1),
loss_fn=lambda x, y: (print(x), x.sum())[1]
)
dataset = DummyDataset()
engine, optimizer, training_dataloader, _ = deepspeed.initialize(
model=model,
model_parameters=[p for p in model.parameters() if p.requires_grad],
training_data=dataset,
config={
"train_micro_batch_size_per_gpu": 1,
"train_batch_size": 1,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"betas": [
0.8,
0.999
],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"zero_optimization": {
"stage": 1,
"offload_optimizer": {
"device": "cpu"
},
"contiguous_gradients": True,
"overlap_comm": True,
"sub_group_size": 1e12,
"reduce_bucket_size": "auto"
}
}
)
print(engine.train_batch())
My command:
CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc_per_node=2 model2.py
Maybe a fix?
I find that this error is caused by the function pin_memory()
in deepspeed/runtime/zero/stage_1_and_2.py:426
and deepspeed/runtime/zero/stage_1_and_2.py:611
. I also find that if I keep executing this function, it has a chance of succeeding. So I changed the code to make it keep trying:
# deepspeed/runtime/zero/stage_1_and_2.py:611
while True:
try:
self.single_partition_of_fp32_groups[i].grad = get_accelerator().pin_memory(
single_grad_partition) if self.cpu_offload else single_grad_partition
break
except RuntimeError as e:
continue
This did fix my problem, but I'm concerned that there may be some additional side effects.
- I also encountered the same problem. Your solution is effective for opt-1.3B, But when training gpt-3.5B, stuck in the loop for a long time. The larger the model, the longer it will be stuck in the loop, the llama-7B has been stuck for an hour and has not successfully run. Do you have any other solutions now?
- I think it may be related to the size of RAM. Can you tell me your GPU and RAM size? My device is 2*A6000 (48G) and 128G of RAM, I want to know the device GPU size and RAM size you used when training the 7B model
@tjruwase My devices is 8*A100 (80G) and 1024G of RAM. And I have fonud another solution. I found that the pin_memory: false
in ds_config
didn't do anything. So I add a patch to make the get_accelerator().pin_memory
useless. See patch_deepspeed()
@00INDEX, apologies for the delayed response. The underlying problem is that cpu_offload was originally developed for zero-stage-2, and is better tested for that scenario. Did you try using zero-stage-2?
@xiaotingyun, are you still having this problem?
@tjruwase I can run through the method of 00INDEX. However, But if I don't modify the source code, as long as offload is turned on, whether it is zero-2 or zero-3, there is a high probability that running a 1.3B model will report an error.
I found out that when the model is very large, the pinned tensor will be very large too (contains billions of elements). Seems that pinning this too-large tensor caused this error. @tjruwase
@x54-729, yes, memory pinning can incur high memory consumption and cause system instability. Additionally, as @00INDEX previously observed, there is a bug where the ds_config of pin_memory: false
is ignored in DeepSpeed. We will fix this asap.
Is your AMD? I find that AMD causes this problem.
@00INDEX, @x54-729, @xiaotingyun, @XChen-Zero please try linked PR if needed.