PaddleNLP
PaddleNLP copied to clipboard
使用 模型推测时会出现 CUDA error(719), unspecified launch failure. 错误
请提出你的问题
当我使用训练的stf结果推测时,随机报错
Error: ../paddle/phi/kernels/gpu/embedding_kernel.cu:41 Assertion `id < N` failed. Id should smaller than 2050 but received an id value: 2050.
Error: ../paddle/phi/kernels/gpu/embedding_kernel.cu:41 Assertion `id < N` failed. Id should smaller than 2050 but received an id value: 2050.
Error: ../paddle/phi/kernels/gpu/embedding_kernel.cu:41 Assertion `id < N` failed. Id should smaller than 2050 but received an id value: 2050.
Error: ../paddle/phi/kernels/gpu/embedding_kernel.cu:41 Assertion `id < N` failed. Id should smaller than 2050 but received an id value: 2050.
Traceback (most recent call last):
File "/hy-tmp/PaddleNLP/llm/predictor.py", line 944, in <module>
predict()
File "/hy-tmp/PaddleNLP/llm/predictor.py", line 888, in predict
outputs = predictor.predict(batch_source_text)
File "/hy-tmp/PaddleNLP/llm/predictor.py", line 183, in predict
predictions = self._infer(tokenized_source)
File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddle/fluid/dygraph/base.py", line 347, in _decorate_function
return func(*args, **kwargs)
File "/hy-tmp/PaddleNLP/llm/predictor.py", line 229, in _infer
result = self.model.generate(
File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddle/fluid/dygraph/base.py", line 347, in _decorate_function
return func(*args, **kwargs)
File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddlenlp/generation/utils.py", line 941, in generate
return self.sample(
File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddlenlp/generation/utils.py", line 1141, in sample
outputs = self(**model_inputs)
File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1254, in __call__
return self.forward(*inputs, **kwargs)
File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddlenlp/transformers/opt/modeling.py", line 1058, in forward
outputs = self.opt(
File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1254, in __call__
return self.forward(*inputs, **kwargs)
File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddlenlp/transformers/opt/modeling.py", line 914, in forward
attention_mask = self._prepare_decoder_attention_mask(attention_mask, input_shape, past_key_values_length)
File "/usr/local/miniconda3/envs/paddlenlp/lib/python3.10/site-packages/paddlenlp/transformers/opt/modeling.py", line 790, in _prepare_decoder_attention_mask
if input_shape[-1] > 1:
OSError: (External) CUDA error(719), unspecified launch failure.
[Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ../paddle/phi/backends/gpu/gpu_context.cc:544)
我的 推测命令
python predictor.py \
--model_name_or_path ./checkpoints/opt_sft_ckpts_125m \
--data_file ./data/tuice_paddlenlp_part-00000.json \
--dtype float16 \
--batch_size 80 \
--output_file ./predictor_out/opt_125m_sft_pdnlp_part-00000.json
使用的是 opt_125m 经过 stf训练 得到的模型
同样的问题
同样的问题,有些数据必现;触发embedding_kernel.cu:41 Assertion id < N
failed
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。
同样的问题