serve icon indicating copy to clipboard operation
serve copied to clipboard

java.lang.AssertionError: Unexpected nvidia-smi response

Open jack-gits opened this issue 3 years ago • 1 comments

🐛 Describe the bug

I'm using pytorch/torchserve:0.6.0-gpu to build my own docker image. when I run the docker image, it raise an error as below

Starting docker image Warning: TorchServe is using non-default JVM parameters: -Dlog4j.configurationFile=./log4j2.xml docker image started java.lang.AssertionError: Unexpected nvidia-smi response. at org.pytorch.serve.util.ConfigManager.getAvailableGpu(ConfigManager.java:740) at org.pytorch.serve.util.ConfigManager.(ConfigManager.java:207) at org.pytorch.serve.util.ConfigManager.init(ConfigManager.java:285) at org.pytorch.serve.ModelServer.main(ModelServer.java:83)

Error logs

Starting docker image Warning: TorchServe is using non-default JVM parameters: -Dlog4j.configurationFile=./log4j2.xml docker image started java.lang.AssertionError: Unexpected nvidia-smi response. at org.pytorch.serve.util.ConfigManager.getAvailableGpu(ConfigManager.java:740) at org.pytorch.serve.util.ConfigManager.(ConfigManager.java:207) at org.pytorch.serve.util.ConfigManager.init(ConfigManager.java:285) at org.pytorch.serve.ModelServer.main(ModelServer.java:83)

Installation instructions

build docker like this: FROM pytorch/torchserve:0.6.0-gpu

Model Packaing

below is the command to package the model:

torch-model-archiver --force --model-name ${model_name_with_version}
--version ${model_version}
--handler model/inference-handler.py
--serialized-file ${model_name_pth}
--runtime python3
--extra-files model/vocab.txt,model/bert_config.json,model/bert_tokenizer.py,model/model_config.json
--export-path ${model_path}

config.properties

No response

Versions

pytorch/torchserve:0.6.0-gpu

Repro instructions

docker run --rm --shm-size=1g
--ulimit memlock=-1
--ulimit stack=67108864
--ulimit nproc=1000
--name neuron-inference-server
-p8080:8080
-p8081:8081
-p8082:8082
neuron-inference/inference-server:v1 torchserve --start --ts-config config.properties --model-store model-store/
--workflow-store model-store/
--models classify-document-eng=document-classify-eng-v4-release.mar
kie-sgp-id=kie-sgp_id-v3-release-0531.mar kie-sgp-id=kie-sgp_id-v1-release.mar kie-sgp-passport=kie-sgp_passport-v1-release.mar

Possible Solution

No response

jack-gits avatar Jun 18 '22 11:06 jack-gits

Hi @jack-gits the issue seems to be that nvidia-smi is not returning anything https://github.com/pytorch/serve/blob/master/frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java#L740 which may happen if your docker image has no access to a GPU. Can you confirm this is the case by running nvidia-smi from inside your container.

Make sure to also use the --gpus argument like here which should fix this https://github.com/pytorch/serve/blob/master/docker/README.md#start-gpu-container

I am wondering out loud whether it makes sense to have --gpus=all be the default behavior

msaroufim avatar Jun 20 '22 21:06 msaroufim

I also encounter this error message: java.lang.AssertionError: Unexpected nvidia-smi response. But my docker image don't have gpu at all, so it's normal that running nvidia-smi returns nothing. How can I use or install a cpu-only torchserve version that don't have this kind of problem????

ERROR REPORT: Type 'nvidia-smi' to two different machine (both have no gpu),the first one response nothing and the next one response 'command not found'. But when I start the same torchserve(.mar file) in the same environment with same config, the first one report 'java.lang.AssertionError: Unexpected nvidia-smi response', but next one start normally. I think this may because of the the irrationality of the code. https://github.com/pytorch/serve/blob/master/frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java#L793

FangXinyu-0913 avatar Oct 11 '23 08:10 FangXinyu-0913

I also encounter this error message: java.lang.AssertionError: Unexpected nvidia-smi response. But my docker image don't have gpu at all, so it's normal that running nvidia-smi returns nothing. How can I use or install a cpu-only torchserve version that don't have this kind of problem????

ERROR REPORT: Type 'nvidia-smi' to two different machine (both have no gpu),the first one response nothing and the next one response 'command not found'. But when I start the same torchserve(.mar file) in the same environment with same config, the first one report 'java.lang.AssertionError: Unexpected nvidia-smi response', but next one start normally. I think this may because of the the irrationality of the code. https://github.com/pytorch/serve/blob/master/frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java#L793

set environment var "CUDA_VISIBLE_DEVICES" via option "-e" when you start docker. docker run -e CUDA_VISIBLE_DEVICES=',' -it your_image or in Dockerfile add ENV CUDA_VISIBLE_DEVICES=','

tengzi-will avatar Mar 08 '24 08:03 tengzi-will