PaddleDetection icon indicating copy to clipboard operation
PaddleDetection copied to clipboard

OSError: (External) CUDA error(719), unspecified launch failure.

Open john09282922 opened this issue 1 year ago • 1 comments

问题确认 Search before asking

  • [X] 我已经查询历史issue,没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

Warning: Unable to use JDE/FairMOT/ByteTrack, please install lap, for example: pip install lap, see https://github.com/gatagat/lap Warning: Unable to use numba in PP-Tracking, please install numba, for example(python3.7): pip install numba==0.56.4 Warning: Unable to use numba in PP-Tracking, please install numba, for example(python3.7): pip install numba==0.56.4 Warning: Unable to use MOT metric, please install motmetrics, for example: pip install motmetrics, see https://github.com/longcw/py-motmetrics Warning: Unable to use MCMOT metric, please install motmetrics, for example: pip install motmetrics, see https://github.com/longcw/py-motmetrics loading annotations into memory... Done (t=0.06s) creating index... index created! [07/29 10:23:03] ppdet.data.source.coco INFO: Load [3271 samples valid, 11 samples invalid] in file dataset/mydata/train/annotations/_annotations.coco.json. W0729 10:23:03.598515 797620 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.0, Runtime API Version: 11.8 W0729 10:23:03.599011 797620 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8. [07/29 10:23:05] ppdet.utils.checkpoint INFO: ['fc.bias', 'fc.weight', 'last_conv.weight'] in pretrained weight is not used in the model, and its will not be loaded [07/29 10:23:05] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/user/.cache/paddle/weights/PPHGNetV2_X_ssld_pretrained.pdparams Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Error: ../paddle/phi/kernels/funcs/gather.cu.h:189 Assertion index_val >= 0 && index_val < input_index_dim_size failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0] Traceback (most recent call last): File "/home/user/test1/PaddleDetection/tools/train.py", line 209, in main() File "/home/user/test1/PaddleDetection/tools/train.py", line 205, in main run(FLAGS, cfg) File "/home/user/test1/PaddleDetection/tools/train.py", line 158, in run trainer.train(FLAGS.eval) File "/home/user/test1/PaddleDetection/ppdet/engine/trainer.py", line 577, in train outputs = model(data) File "/home/user/anaconda3/envs/paddle/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1254, in call return self.forward(*inputs, **kwargs) File "/home/user/test1/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 60, in forward out = self.get_loss() File "/home/user/test1/PaddleDetection/ppdet/modeling/architectures/detr.py", line 115, in get_loss return self._forward() File "/home/user/test1/PaddleDetection/ppdet/modeling/architectures/detr.py", line 93, in _forward detr_losses = self.detr_head(out_transformer, body_feats, File "/home/user/anaconda3/envs/paddle/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1254, in call return self.forward(*inputs, **kwargs) File "/home/user/test1/PaddleDetection/ppdet/modeling/heads/detr_head.py", line 453, in forward return self.loss( File "/home/user/anaconda3/envs/paddle/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1254, in call return self.forward(*inputs, **kwargs) File "/home/user/test1/PaddleDetection/ppdet/modeling/losses/detr_loss.py", line 434, in forward total_loss = super(DINOLoss, self).forward( File "/home/user/test1/PaddleDetection/ppdet/modeling/losses/detr_loss.py", line 388, in forward total_loss = self._get_prediction_loss( File "/home/user/test1/PaddleDetection/ppdet/modeling/losses/detr_loss.py", line 322, in _get_prediction_loss match_indices = self.matcher( File "/home/user/anaconda3/envs/paddle/lib/python3.9/site-packages/paddle/nn/layer/layers.py", line 1254, in call return self.forward(*inputs, **kwargs) File "/home/user/test1/PaddleDetection/ppdet/modeling/transformers/matchers.py", line 178, in forward indices = [ File "/home/user/test1/PaddleDetection/ppdet/modeling/transformers/matchers.py", line 179, in linear_sum_assignment(c.split(sizes, -1)[i].numpy()) OSError: (External) CUDA error(719), unspecified launch failure. [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:267)

training error

复现环境 Environment

OS: Linux Ver: Paddle-gpu 2.5.0 cuda 11.2 ~ 12.0

Bug描述确认 Bug description confirmation

  • [X] 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息,确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR? Are you willing to submit a PR?

  • [X] 我愿意提交PR!I'd like to help by submitting a PR!

john09282922 avatar Jul 29 '23 01:07 john09282922

请问解决了吗,我也遇到同样的问题

indulgence1 avatar Dec 19 '23 06:12 indulgence1

自己的数据嘛 用的那个模型 还有版本号

lyuwenyu avatar Mar 04 '24 08:03 lyuwenyu