PaddleNLP icon indicating copy to clipboard operation
PaddleNLP copied to clipboard

使用 模型推测时会出现 CUDA error(719), unspecified launch failure. 错误

Open xujiang1 opened this issue 1 year ago • 3 comments

请提出你的问题

当我使用训练的stf结果推测时,随机报错

Error: ../paddle/phi/kernels/gpu/embedding_kernel.cu:41 Assertion `id < N` failed. Id should smaller than 2050 but received an id value: 2050.
Error: ../paddle/phi/kernels/gpu/embedding_kernel.cu:41 Assertion `id < N` failed. Id should smaller than 2050 but received an id value: 2050.
Error: ../paddle/phi/kernels/gpu/embedding_kernel.cu:41 Assertion `id < N` failed. Id should smaller than 2050 but received an id value: 2050.
Error: ../paddle/phi/kernels/gpu/embedding_kernel.cu:41 Assertion `id < N` failed. Id should smaller than 2050 but received an id value: 2050.
Traceback (most recent call last):
  File "/hy-tmp/PaddleNLP/llm/predictor.py", line 944, in <module>
    predict()
  File "/hy-tmp/PaddleNLP/llm/predictor.py", line 888, in predict
    outputs = predictor.predict(batch_source_text)
  File "/hy-tmp/PaddleNLP/llm/predictor.py", line 183, in predict
    predictions = self._infer(tokenized_source)
  File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddle/fluid/dygraph/base.py", line 347, in _decorate_function
    return func(*args, **kwargs)
  File "/hy-tmp/PaddleNLP/llm/predictor.py", line 229, in _infer
    result = self.model.generate(
  File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddle/fluid/dygraph/base.py", line 347, in _decorate_function
    return func(*args, **kwargs)
  File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddlenlp/generation/utils.py", line 941, in generate
    return self.sample(
  File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddlenlp/generation/utils.py", line 1141, in sample
    outputs = self(**model_inputs)
  File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1254, in __call__
    return self.forward(*inputs, **kwargs)
  File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddlenlp/transformers/opt/modeling.py", line 1058, in forward
    outputs = self.opt(
  File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1254, in __call__
    return self.forward(*inputs, **kwargs)
  File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddlenlp/transformers/opt/modeling.py", line 914, in forward
    attention_mask = self._prepare_decoder_attention_mask(attention_mask, input_shape, past_key_values_length)
  File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddlenlp/transformers/opt/modeling.py", line 790, in _prepare_decoder_attention_mask
    if input_shape[-1] > 1:
OSError: (External) CUDA error(719), unspecified launch failure. 
  [Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ../paddle/phi/backends/gpu/gpu_context.cc:544)

我的 推测命令

python predictor.py \
    --model_name_or_path ./checkpoints/opt_sft_ckpts_125m \
    --data_file ./data/tuice_paddlenlp_part-00000.json \
    --dtype float16  \
    --batch_size 80  \
    --output_file ./predictor_out/opt_125m_sft_pdnlp_part-00000.json

使用的是 opt_125m 经过 stf训练 得到的模型

xujiang1 avatar Oct 24 '23 09:10 xujiang1

同样的问题

CX26-CX avatar Oct 24 '23 12:10 CX26-CX

同样的问题,有些数据必现;触发embedding_kernel.cu:41 Assertion id < N failed

mmx110 avatar Dec 05 '23 09:12 mmx110

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] avatar Feb 04 '24 00:02 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。

github-actions[bot] avatar Feb 19 '24 00:02 github-actions[bot]

同样的问题

xxz-wow avatar Aug 14 '24 01:08 xxz-wow