serve RuntimeError: CUDA error: no kernel image is available for execution on the device

RuntimeError: CUDA error: no kernel image is available for execution on the device

Open amanjain1397appy opened this issue 3 years ago • 7 comments

I am deploying a model using torchserve. It was easily deployed on Tesla K80 GPU. But now when i shifted it to newer GPU Nvidia A30. I am getting this error

torch_version: '1.12.0+cu113' cudnn_version: 8302

PFA Nvidia-smi Screenshot

Jul 22 '22 08:07 amanjain1397appy

Hi @amanjain1397appy does your model also fail when you're not serving it with torchserve? I would need more repro instructions which you can see in the bug template we have here on github https://github.com/pytorch/serve/issues/new?assignees=&labels=&template=bug.yml

Regardless my suspicion is this is happening because we don't support pytorch 1.12 in torchserve yet but this was something @mreso had been asking for and @lxning was working on so we'll keep you posted!

If you run this script it should confirm which version of pytorch you actually have installed https://github.com/pytorch/serve/blob/master/ts_scripts/print_env_info.py

If you can't wait feel free to update the version for torch here in the requirements.txt and hopefully the issue just goes away https://github.com/pytorch/serve/tree/master/requirements

Jul 22 '22 16:07 msaroufim

@msaroufim My model does not fail when I am not serving with Torchserve. Speaking of the pytorch version, during my first attempt I actually used the torch version in requirements.txt and still I faced the same issue.

Jul 22 '22 17:07 amanjain1397appy

Hi @amanjain1397appy if you try torch version 1.11 without torchserve do you see this error?

Jul 22 '22 18:07 msaroufim

I don't get any such errors when I try torch 1.11 without torchserve.

Jul 22 '22 18:07 amanjain1397appy

Ok cool, if you can share a repro that'll help make things go faster. Otherwise we'll take a look at how we run on A100 when we upgrade to 1.12

Jul 22 '22 18:07 msaroufim

I'll try to share a repro.

Just wanted to let you know that when I try out the examples already present in torchserve repo for example dcgan_fashiongen they work absoutely fine.

Attaching this extra information just in case. Screenshot 2022-07-23 003833

Even the current model in which I am facing issues, was working fine on Torchserve in other machine having the following specs.

Jul 22 '22 19:07 amanjain1397appy

@msaroufim I have reproduced the code and pushed it here at https://github.com/amanjain1397appy/serve-personal

Jul 22 '22 20:07 amanjain1397appy

serve serve copied to clipboard

RuntimeError: CUDA error: no kernel image is available for execution on the device

serve
serve copied to clipboard