环境如下:
pip list | grep torch
jtorch 0.1.7
torch 2.0.0
torchvision 0.15
pip list | grep jittor
jittor 1.3.8.5
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
dpkg -l | grep cudnn
ii libcudnn8 8.6.0.163-1+cuda11.8 amd64 cuDNN runtime libraries
ii libcudnn8-dev 8.6.0.163-1+cuda11.8 amd64 cuDNN development libraries and headers
报错如下:
[e 0225 02:32:49.526881 64 mem_info.cc:101] appear time -> node cnt: {1:9883, }
Traceback (most recent call last):
File "/LocalRun/zhiguo.li/JittorLLMs/cli_demo.py", line 9, in
model.chat()
File "/LocalRun/zhiguo.li/JittorLLMs/models/llama/init.py", line 111, in chat
results = self.generator.chat_completion([dialog], max_gen_len=128, temperature=0.6, top_p=0.9)
File "/LocalRun/zhiguo.li/JittorLLMs/models/llama/llama/generation2.py", line 211, in chat_completion
generation_tokens, generation_logprobs = self.generate(
File "/LocalRun/zhiguo.li/JittorLLMs/models/llama/llama/generation2.py", line 101, in generate
if all(eos_reached):
File "/opt/conda/envs/jittor/lib/python3.10/site-packages/jittor/init.py", line 2026, in to_bool
return ori_bool(v.item())
RuntimeError: [f 0225 02:32:49.526952 64 executor.cc:686]
Execute fused operator(214/2843) failed.
[JIT Source]: /root/.cache/jittor/jt1.3.8/g++9.4.0/py3.10.16/Linux-5.4.0-15x6b/IntelRXeonRGolx38/default/cu11.8.89_sm_86/jit/getitem__Ti_float16__IDIM_2__ODIM_3__FOV_0__VD_2__IV0_0__IO0__1__VS0_3__VST0_int32__IV1__1___hash_a8c24438e43d9846_op.cc
[OP TYPE]: getitem
[Input]: float16[32000,4096,]tok_embeddings.weight, int32[1,13,],
[Output]: float16[1,13,4096,],
[Async Backtrace]: ---
/LocalRun/zhiguo.li/JittorLLMs/cli_demo.py:9 <>
/LocalRun/zhiguo.li/JittorLLMs/models/llama/init.py:111
/LocalRun/zhiguo.li/JittorLLMs/models/llama/llama/generation2.py:211 <chat_completion>
/LocalRun/zhiguo.li/JittorLLMs/models/llama/llama/generation2.py:76
/opt/conda/envs/jittor/lib/python3.10/site-packages/jittor/init.py:1172 <call>
/LocalRun/zhiguo.li/JittorLLMs/models/llama/llama/model.py:216
/opt/conda/envs/jittor/lib/python3.10/site-packages/jittor/init.py:1172 <call>
/opt/conda/envs/jittor/lib/python3.10/site-packages/jittor/nn.py:1751
/opt/conda/envs/jittor/lib/python3.10/site-packages/jittor/contrib.py:192
[Reason]: [f 0225 02:32:49.524565 64 helper_cuda.h:128] CUDA error at /opt/conda/envs/jittor/lib/python3.10/site-packages/jittor/src/executor.cc:639 code=700( cudaErrorIllegalAddress ) cudaDeviceSynchronize()
terminate called after throwing an instance of 'std::runtime_error'
what(): [f 0225 02:32:49.585942 64 helper_cuda.h:128] CUDA error at /opt/conda/envs/jittor/lib/python3.10/site-packages/jittor/extern/cuda/cudnn/src/cudnn_wrapper.cc:34 code=4( CUDNN_STATUS_INTERNAL_ERROR ) cudnnDestroy(cudnn_handle)
Aborted (core dumped)