PaddleSeg icon indicating copy to clipboard operation
PaddleSeg copied to clipboard

[Hint: 'CUBLAS_STATUS_EXECUTION_FAILED'. The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons.

Open fanweiya opened this issue 3 years ago • 5 comments

paddlepaddle-gpu 2.3.0.post110 run errror

  File "/data/PaddleSeg-2.5/paddleseg/core/train.py", line 204, in train
    logits_list = ddp_model(images) if nranks > 1 else model(images)
  File "/data/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/data/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/data/PaddleSeg-2.5/paddleseg/models/segmenter.py", line 55, in forward
    feats, shape = self.backbone(x)
  File "/data/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/data/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/data/PaddleSeg-2.5/paddleseg/models/backbones/vision_transformer.py", line 276, in forward
    x = blk(x)
  File "/data/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/data/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/data/PaddleSeg-2.5/paddleseg/models/backbones/vision_transformer.py", line 119, in forward
    x = x + self.drop_path(self.attn(self.norm1(x)))
  File "/data/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/data/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/data/PaddleSeg-2.5/paddleseg/models/backbones/vision_transformer.py", line 73, in forward
    qkv = self.qkv(x).reshape((-1, N, 3, self.num_heads, C //
  File "/data/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/data/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/data/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/nn/layer/common.py", line 172, in forward
    x=input, weight=self.weight, bias=self.bias, name=self.name)
  File "/data/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/nn/functional/common.py", line 1542, in linear
    False)
OSError: (External) CUBLAS error(13). 
  [Hint: 'CUBLAS_STATUS_EXECUTION_FAILED'.  The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons.  To correct: check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed. ] (at /paddle/paddle/phi/kernels/funcs/blas/blas_impl.cu.h:35)
  [operator < matmul_v2 > error]

NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0

fanweiya avatar Jun 22 '22 06:06 fanweiya

This could be caused by out-of-memory in GPU or GPU launch issue. Did you check your device with "nvidia-smi"? What is the script you used to run this program? And did you change any of our code?

shiyutang avatar Jun 22 '22 08:06 shiyutang

GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:3B:00.0 Off | N/A | | 30% 25C P8 19W / 350W | 5MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:B1:00.0 Off | N/A | | 51% 62C P2 182W / 350W | 22705MiB / 24576MiB | 50% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2635 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 2635 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 29918 C python 22697MiB

fanweiya avatar Jun 23 '22 03:06 fanweiya

python train.py --config custom_config/segmenter_vit_base_linear_xxx_512x512_160k.yml --do_eval --use_vdl --save_interval 500 --save_dir segmenter_vit_base_linear_xxx_20220610_512x512_160k

fanweiya avatar Jun 23 '22 03:06 fanweiya

i got same error when running for segformer model.

nvidia-smi

| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A | | 36% 62C P2 105W / 350W | 541MiB / 12288MiB | 0% Default | | | | N/A | | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | 0 N/A N/A 1319 G /usr/lib/xorg/Xorg 35MiB | | 0 N/A N/A 2121 G /usr/lib/xorg/Xorg 102MiB | | 0 N/A N/A 2250 G /usr/bin/gnome-shell 27MiB | | 0 N/A N/A 5491 C python3 359MiB |

from scipy.ndimage.interpolation import shift /home/boe-malenia-23/anaconda3/PaddleSeg/paddleseg/transforms/functional.py:18: DeprecationWarning: Please use distance_transform_edt from the scipy.ndimage namespace, the scipy.ndimage.morphology namespace is deprecated. from scipy.ndimage.morphology import distance_transform_edt 2022-06-27 17:50:21 [INFO] ------------Environment Information------------- platform: Linux-5.13.0-51-generic-x86_64-with-glibc2.17 Python: 3.8.10 (default, Jun 4 2021, 15:09:15) [GCC 7.5.0] Paddle compiled with cuda: True NVCC: Build cuda_11.6.r11.6/compiler.31057947_0 cudnn: 8.4 GPUs used: 1 CUDA_VISIBLE_DEVICES: None GPU: ['GPU 0: NVIDIA GeForce'] GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PaddleSeg: 2.5.0 PaddlePaddle: 2.3.0 OpenCV: 4.5.3-openvino

2022-06-27 17:50:21 [INFO] ---------------Config Information--------------- batch_size: 1 distill_loss: coef:

  • 3 types:
  • type: KLLoss iters: 1000 loss: coef:
  • 1 types:
  • ignore_index: 255 type: CrossEntropyLoss lr_scheduler: learning_rate: 6.0e-05 power: 1 type: PolynomialDecay model: num_classes: 19 pretrained: https://bj.bcebos.com/paddleseg/dygraph/mix_vision_transformer_b2.tar.gz type: SegFormer_B2 optimizer: beta1: 0.9 beta2: 0.999 type: AdamW weight_decay: 0.01 train_dataset: dataset_root: data/cityscape mode: train transforms:
  • target_size:
    • 1024
    • 1024 type: Resize
  • type: RandomHorizontalFlip
  • type: Normalize type: Cityscapes val_dataset: dataset_root: data/cityscape mode: val transforms:
  • target_size:
    • 1024
    • 1024 type: Resize
  • type: Normalize type: Cityscapes

W0627 17:50:21.221050 27212 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.2 W0627 17:50:21.221062 27212 gpu_context.cc:306] device: 0, cuDNN Version: 8.4. 2022-06-27 17:50:22 [INFO] Loading pretrained model from https://bj.bcebos.com/paddleseg/dygraph/mix_vision_transformer_b2.tar.gz 2022-06-27 17:50:22 [WARNING] linear_c4.proj.weight is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_c4.proj.bias is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_c3.proj.weight is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_c3.proj.bias is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_c2.proj.weight is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_c2.proj.bias is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_c1.proj.weight is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_c1.proj.bias is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_fuse._conv.weight is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_fuse._batch_norm.weight is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_fuse._batch_norm.bias is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_fuse._batch_norm._mean is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_fuse._batch_norm._variance is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_pred.weight is not in pretrained model 2022-06-27 17:50:22 [WARNING] linear_pred.bias is not in pretrained model 2022-06-27 17:50:22 [INFO] There are 332/347 variables loaded into SegFormer. Traceback (most recent call last): File "train.py", line 230, in main(args) File "train.py", line 206, in main train( File "/home/boe-malenia-23/anaconda3/PaddleSeg/paddleseg/core/train.py", line 204, in train logits_list = ddp_model(images) if nranks > 1 else model(images) File "/home/boe-malenia-23/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "/home/boe-malenia-23/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "/home/boe-malenia-23/anaconda3/PaddleSeg/paddleseg/models/segformer.py", line 83, in forward feats = self.backbone(x) File "/home/boe-malenia-23/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "/home/boe-malenia-23/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "/home/boe-malenia-23/anaconda3/PaddleSeg/paddleseg/models/backbones/mix_transformer.py", line 472, in forward x = self.forward_features(x) File "/home/boe-malenia-23/anaconda3/PaddleSeg/paddleseg/models/backbones/mix_transformer.py", line 439, in forward_features x = blk(x, H, W) File "/home/boe-malenia-23/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "/home/boe-malenia-23/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "/home/boe-malenia-23/anaconda3/PaddleSeg/paddleseg/models/backbones/mix_transformer.py", line 199, in forward x = x + self.drop_path(self.attn(self.norm1(x), H, W)) File "/home/boe-malenia-23/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "/home/boe-malenia-23/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "/home/boe-malenia-23/anaconda3/PaddleSeg/paddleseg/models/backbones/mix_transformer.py", line 122, in forward q = self.q(x).reshape([B, N, self.num_heads, File "/home/boe-malenia-23/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "/home/boe-malenia-23/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "/home/boe-malenia-23/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/nn/layer/common.py", line 171, in forward out = F.linear( File "/home/boe-malenia-23/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/nn/functional/common.py", line 1541, in linear pre_bias = _C_ops.matmul_v2(x, weight, 'trans_x', False, 'trans_y', OSError: (External) CUBLAS error(13). [Hint: 'CUBLAS_STATUS_EXECUTION_FAILED'. The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons. To correct: check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed. ] (at /paddle/paddle/phi/kernels/funcs/blas/blas_impl.cu.h:35) [operator < matmul_v2 > error]

command: python3 train.py
--config configs/segformer/segformer_b1_cityscapes_1024x1024_160k.yml
--do_eval
--use_vdl
--save_interval 500
--save_dir segformerB2

hannaSkyrim avatar Jun 27 '22 10:06 hannaSkyrim

I get the same error when runing paddle ocr

AntonyChen89 avatar Sep 15 '22 07:09 AntonyChen89

To correct: check that the hardware, an appropriate driver version, and the cuBLAS library are correctly installed. This could be related to the environment, like the cublas library and the driver. Try to create a new environment with conda and install paddle on cuda 10.2 and cudnn 7.6.5 with a compatible version of driver.

shiyutang avatar Dec 02 '22 09:12 shiyutang