CogVLM CUDA error: an illegal memory access was encountered

trafficstars

System Info / 系統信息

GPU: a100-80g CUDA Version: 12.1 python:3.8 pytorch:2.2.1

Who can help? / 谁可以帮助到您？

@1049451037

Information / 问题信息

[x] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

一张a100-80g显存微调cogagent，batchsize修改为1，将脚本中disable_untrainable_params修改为

def disable_untrainable_params(self):
    total_trainable = 0
    enable = []
    # enable = ["encoder"]
    # enable = ["encoder", "cross_attention", "linear_proj", 'mlp.vision', 'rotary.vision', 'eoi', 'boi', 'vit']
    if self.args.use_ptuning:
        enable.extend(["ptuning"])
    if self.args.use_lora or self.args.use_qlora:
        pass
        enable.extend(["matrix_A", "matrix_B"])
    out_file = open("named_parameters.txt", "w", encoding="utf-8")
    for n, p in self.named_parameters():
        out_file.write("named_parameters: " + n)
        flag = False
        # 只微调语言模型部分
        if n.lower().startswith("transformer.layers"):
            flag = "matrix_" in n.lower()
        elif n.lower().startswith("mixins.rotary.vision_"):
            flag = True
        if not flag:
            p.requires_grad_(False)
        else:
            total_trainable += p.numel()
            if "encoder" in n or "vit" in n:
                p.lr_scale = 0.1
            print_rank0(n)
        out_file.write(" enable: " + str(flag))
        out_file.write("\n")
    out_file.close()
    print_rank0("***** Total trainable parameters: " + str(total_trainable) + " *****")

运行微调脚本，显存只占用到72G，然后报错RuntimeError: CUDA error: an illegal memory access was encountered

报错日志：

[2024-04-07 10:55:05,908] [INFO] [checkpointing.py:539:forward] Activation Checkpointing Information
[2024-04-07 10:55:05,909] [INFO] [checkpointing.py:540:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-04-07 10:55:05,909] [INFO] [checkpointing.py:541:forward] ----contiguous Memory Checkpointing False with 6 total layers
[2024-04-07 10:55:05,909] [INFO] [checkpointing.py:543:forward] ----Synchronization False
[2024-04-07 10:55:05,909] [INFO] [checkpointing.py:544:forward] ----Profiling time in checkpointing False
logits:  torch.Size([1, 400, 32000])
Traceback (most recent call last):
  File "finetune_cogagent_demo.py", line 400, in <module>
    model = training_main(
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 150, in training_main
    iteration, skipped = train(model, optimizer,
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 349, in train
    lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 482, in train_step
    model.step()
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2169, in step
    self._take_model_step(lr_kwargs)
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
    self.optimizer.step()
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1898, in step
    self._optimizer_step(i)
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1805, in _optimizer_step
    self.optimizer.step()
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/optim/optimizer.py", line 385, in wrapper
    out = func(*args, **kwargs)
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 191, in step
    multi_tensor_applier(self.multi_tensor_adam, self._dummy_overflow_buf, [g_32, p_32, m_32, v_32],
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/ops/adam/multi_tensor_apply.py", line 17, in __call__
    return op(self.chunk_size, noop_flag_buffer, tensor_lists, *args)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1708025829503/work/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f44740f0d87 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f44740a175f in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f44741c28a8 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f44752859ec in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4475289b08 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f447528d23a in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f447528de79 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x7f44d1e3ebf4 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x7ea7 (0x7f44da7b7ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x3f (0x7f44da588a6f in /lib/x86_64-linux-gnu/libc.so.6)

[2024-04-07 10:55:20,255] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2708
[2024-04-07 10:55:20,255] [ERROR] [launch.py:322:sigkill_handler] ['/root/miniconda3/envs/cog-agent/bin/python', '-u', 'finetune_cogagent_demo.py', '--local_rank=0', '--experiment-name', 'finetune-cogagent-chat', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '2000', '--resume-dataloader', '--from_pretrained', '../sat_models/cogagent-chat', '--max_length', '400', '--lora_rank', '50', '--use_lora', '--local_tokenizer', '../pretrained_models/lmsys/vicuna-7b-v1.5', '--version', 'chat', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023', '--batch-size', '1'] exits with return code = -6

Expected behavior / 期待表现

报错不复现

Apr 07 '24 11:04 zzh-www

看起来和这个issue情况类似 https://github.com/THUDM/CogVLM/issues/124

Apr 07 '24 11:04 zzh-www

你这个情况却不是没有正确的安装cuda配置啊，cuda 套件要装

Apr 08 '24 05:04 zRzRzRzRzRzRzR

你这个情况却不是没有正确的安装cuda配置啊，cuda 套件要装

cuda是装了的，如果我将微调的参数设少些，例如只调["matrix_A", "matrix_B"]，训练脚本是能够正常跑的

Apr 08 '24 06:04 zzh-www

CogVLM CogVLM copied to clipboard

CUDA error: an illegal memory access was encountered

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

CogVLM
CogVLM copied to clipboard