DeepSpeed [BUG] pin_memory() raises an error when using pipeline parallelism

Describe the bug when I use pipeline parallelism and ZeRo-1 offload at the same time, there will be an error:

│ /mnt/petrelfs/zhangshuo/miniconda3/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/st │
│ age_1_and_2.py:485 in __init__                                                                   │
│                                                                                                  │
│    482 │   │   self.dynamic_loss_scale = self.loss_scaler.dynamic                                │
│    483 │   │                                                                                     │
│    484 │   │   see_memory_usage("Before initializing optimizer states", force=True)              │
│ ❱  485 │   │   self.initialize_optimizer_states()                                                │
│    486 │   │   see_memory_usage("After initializing optimizer states", force=True)               │
│    487 │   │                                                                                     │
│    488 │   │   if dist.get_rank() == 0:                                                          │
│                                                                                                  │
│ /mnt/petrelfs/zhangshuo/miniconda3/envs/dl/lib/python3.9/site-packages/deepspeed/runtime/zero/st │
│ age_1_and_2.py:615 in initialize_optimizer_states                                                │
│                                                                                                  │
│    612 │   │   │   print(single_grad_partition.dtype)                                            │
│    613 │   │   │   print(single_grad_partition.device)                                           │
│    614 │   │   │   print(single_grad_partition.shape)                                            │
│ ❱  615 │   │   │   self.single_partition_of_fp32_groups[i].grad = get_accelerator().pin_memory(  │
│    616 │   │   │   │   single_grad_partition) if self.cpu_offload else single_grad_partition     │
│    617 │   │                                                                                     │
│    618 │   │   self.optimizer.step()                                                             │
│                                                                                                  │
│ /mnt/petrelfs/zhangshuo/miniconda3/envs/dl/lib/python3.9/site-packages/deepspeed/accelerator/cud │
│ a_accelerator.py:217 in pin_memory                                                               │
│                                                                                                  │
│   214 │   │   return torch.cuda.LongTensor                                                       │
│   215 │                                                                                          │
│   216 │   def pin_memory(self, tensor):                                                          │
│ ❱ 217 │   │   return tensor.pin_memory()                                                         │
│   218 │                                                                                          │
│   219 │   def on_accelerator(self, tensor):                                                      │
│   220 │   │   device_str = str(tensor.device)                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: OS call failed or operation not supported on this OS
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

And this error only happens when my model is really large (maybe 7 billion parameters).

To Reproduce

My code:

# model2.py
import torch
import torch.nn as nn
from torch.utils.data import Dataset

import deepspeed
from deepspeed.pipe import LayerSpec
from deepspeed.pipe import PipelineModule
from deepspeed.runtime.pipe.topology import PipeModelDataParallelTopology

class Layer1(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.linear = nn.Linear(1, 1000000000)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.linear(x)
    
class Layer2(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.linear = nn.Linear(1000000000, 1)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.linear(x)

class DummyDataset(Dataset):
    def __init__(self):
        pass
        
    def __len__(self):
        return 100
    
    def __getitem__(self, idx):
        return torch.tensor([idx]).float(), torch.tensor([idx]).float()
    
deepspeed.init_distributed(dist_backend='nccl', init_method="env://")
model = PipelineModule(
    layers=[LayerSpec(Layer1), LayerSpec(Layer2)],
    num_stages=2,
    topology=PipeModelDataParallelTopology(num_pp=2, num_dp=1, num_mp=1),
    loss_fn=lambda x, y: (print(x), x.sum())[1]
)
dataset = DummyDataset()
engine, optimizer, training_dataloader, _ = deepspeed.initialize(
        model=model,
        model_parameters=[p for p in model.parameters() if p.requires_grad],
        training_data=dataset,
        config={
            "train_micro_batch_size_per_gpu": 1,
            "train_batch_size": 1,
            "gradient_accumulation_steps": 1,
            "optimizer": {
                "type": "Adam",
                "params": {
                "lr": 0.001,
                "betas": [
                    0.8,
                    0.999
                ],
                "eps": 1e-8,
                "weight_decay": 3e-7
                }
            },
            "zero_optimization": {
                "stage": 1,
                "offload_optimizer": {
                    "device": "cpu"
                },
                "contiguous_gradients": True,
                "overlap_comm": True,
                "sub_group_size": 1e12,
                "reduce_bucket_size": "auto"
            }
        }
    )
print(engine.train_batch())

My command:

CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc_per_node=2 model2.py

Maybe a fix? I find that this error is caused by the function pin_memory() in deepspeed/runtime/zero/stage_1_and_2.py:426 and deepspeed/runtime/zero/stage_1_and_2.py:611. I also find that if I keep executing this function, it has a chance of succeeding. So I changed the code to make it keep trying:

# deepspeed/runtime/zero/stage_1_and_2.py:611
while True:
                try:
                    self.single_partition_of_fp32_groups[i].grad = get_accelerator().pin_memory(
                        single_grad_partition) if self.cpu_offload else single_grad_partition
                    break
                except RuntimeError as e:
                    continue

This did fix my problem, but I'm concerned that there may be some additional side effects.

May 07 '23 14:05 00INDEX

I also encountered the same problem. Your solution is effective for opt-1.3B, But when training gpt-3.5B, stuck in the loop for a long time. The larger the model, the longer it will be stuck in the loop, the llama-7B has been stuck for an hour and has not successfully run. Do you have any other solutions now?
I think it may be related to the size of RAM. Can you tell me your GPU and RAM size? My device is 2*A6000 (48G) and 128G of RAM, I want to know the device GPU size and RAM size you used when training the 7B model

May 22 '23 08:05 xiaotingyun

@tjruwase My devices is 8*A100 (80G) and 1024G of RAM. And I have fonud another solution. I found that the pin_memory: false in ds_config didn't do anything. So I add a patch to make the get_accelerator().pin_memory useless. See patch_deepspeed()

May 25 '23 13:05 00INDEX

@00INDEX, apologies for the delayed response. The underlying problem is that cpu_offload was originally developed for zero-stage-2, and is better tested for that scenario. Did you try using zero-stage-2?

May 25 '23 17:05 tjruwase

@xiaotingyun, are you still having this problem?

Jun 02 '23 19:06 tjruwase

@tjruwase I can run through the method of 00INDEX. However, But if I don't modify the source code, as long as offload is turned on, whether it is zero-2 or zero-3, there is a high probability that running a 1.3B model will report an error.

Jun 03 '23 13:06 xiaotingyun

I found out that when the model is very large, the pinned tensor will be very large too (contains billions of elements). Seems that pinning this too-large tensor caused this error. @tjruwase

Jun 08 '23 03:06 x54-729

@x54-729, yes, memory pinning can incur high memory consumption and cause system instability. Additionally, as @00INDEX previously observed, there is a bug where the ds_config of pin_memory: false is ignored in DeepSpeed. We will fix this asap.

Jun 08 '23 16:06 tjruwase

Is your AMD? I find that AMD causes this problem.

Jun 14 '23 02:06 XChen-Zero

@00INDEX, @x54-729, @xiaotingyun, @XChen-Zero please try linked PR if needed.

Aug 10 '23 11:08 tjruwase

DeepSpeed DeepSpeed copied to clipboard

[BUG] pin_memory() raises an error when using pipeline parallelism

DeepSpeed
DeepSpeed copied to clipboard