CogVLM
CogVLM copied to clipboard
CUDA error: an illegal memory access was encountered
trafficstars
System Info / 系統信息
GPU: a100-80g CUDA Version: 12.1 python:3.8 pytorch:2.2.1
Who can help? / 谁可以帮助到您?
@1049451037
Information / 问题信息
- [x] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
一张a100-80g显存微调cogagent,batchsize修改为1,将脚本中disable_untrainable_params修改为
def disable_untrainable_params(self):
total_trainable = 0
enable = []
# enable = ["encoder"]
# enable = ["encoder", "cross_attention", "linear_proj", 'mlp.vision', 'rotary.vision', 'eoi', 'boi', 'vit']
if self.args.use_ptuning:
enable.extend(["ptuning"])
if self.args.use_lora or self.args.use_qlora:
pass
enable.extend(["matrix_A", "matrix_B"])
out_file = open("named_parameters.txt", "w", encoding="utf-8")
for n, p in self.named_parameters():
out_file.write("named_parameters: " + n)
flag = False
# 只微调语言模型部分
if n.lower().startswith("transformer.layers"):
flag = "matrix_" in n.lower()
elif n.lower().startswith("mixins.rotary.vision_"):
flag = True
if not flag:
p.requires_grad_(False)
else:
total_trainable += p.numel()
if "encoder" in n or "vit" in n:
p.lr_scale = 0.1
print_rank0(n)
out_file.write(" enable: " + str(flag))
out_file.write("\n")
out_file.close()
print_rank0("***** Total trainable parameters: " + str(total_trainable) + " *****")
运行微调脚本,显存只占用到72G,然后报错RuntimeError: CUDA error: an illegal memory access was encountered
报错日志:
[2024-04-07 10:55:05,908] [INFO] [checkpointing.py:539:forward] Activation Checkpointing Information
[2024-04-07 10:55:05,909] [INFO] [checkpointing.py:540:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-04-07 10:55:05,909] [INFO] [checkpointing.py:541:forward] ----contiguous Memory Checkpointing False with 6 total layers
[2024-04-07 10:55:05,909] [INFO] [checkpointing.py:543:forward] ----Synchronization False
[2024-04-07 10:55:05,909] [INFO] [checkpointing.py:544:forward] ----Profiling time in checkpointing False
logits: torch.Size([1, 400, 32000])
Traceback (most recent call last):
File "finetune_cogagent_demo.py", line 400, in <module>
model = training_main(
File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 150, in training_main
iteration, skipped = train(model, optimizer,
File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 349, in train
lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 482, in train_step
model.step()
File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2169, in step
self._take_model_step(lr_kwargs)
File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self.optimizer.step()
File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1898, in step
self._optimizer_step(i)
File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1805, in _optimizer_step
self.optimizer.step()
File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/optim/optimizer.py", line 385, in wrapper
out = func(*args, **kwargs)
File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 191, in step
multi_tensor_applier(self.multi_tensor_adam, self._dummy_overflow_buf, [g_32, p_32, m_32, v_32],
File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/ops/adam/multi_tensor_apply.py", line 17, in __call__
return op(self.chunk_size, noop_flag_buffer, tensor_lists, *args)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1708025829503/work/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f44740f0d87 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f44740a175f in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f44741c28a8 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f44752859ec in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4475289b08 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f447528d23a in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f447528de79 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x7f44d1e3ebf4 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x7ea7 (0x7f44da7b7ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x3f (0x7f44da588a6f in /lib/x86_64-linux-gnu/libc.so.6)
[2024-04-07 10:55:20,255] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2708
[2024-04-07 10:55:20,255] [ERROR] [launch.py:322:sigkill_handler] ['/root/miniconda3/envs/cog-agent/bin/python', '-u', 'finetune_cogagent_demo.py', '--local_rank=0', '--experiment-name', 'finetune-cogagent-chat', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '2000', '--resume-dataloader', '--from_pretrained', '../sat_models/cogagent-chat', '--max_length', '400', '--lora_rank', '50', '--use_lora', '--local_tokenizer', '../pretrained_models/lmsys/vicuna-7b-v1.5', '--version', 'chat', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023', '--batch-size', '1'] exits with return code = -6
Expected behavior / 期待表现
报错不复现
看起来和这个issue情况类似 https://github.com/THUDM/CogVLM/issues/124
你这个情况却不是没有正确的安装cuda配置啊,cuda 套件要装
你这个情况却不是没有正确的安装cuda配置啊,cuda 套件要装
cuda是装了的,如果我将微调的参数设少些,例如只调["matrix_A", "matrix_B"],训练脚本是能够正常跑的