serve
                                
                                
                                
                                    serve copied to clipboard
                            
                            
                            
                        java.lang.AssertionError: Unexpected nvidia-smi response
🐛 Describe the bug
I'm using pytorch/torchserve:0.6.0-gpu to build my own docker image. when I run the docker image, it raise an error as below
Starting docker image
Warning: TorchServe is using non-default JVM parameters: -Dlog4j.configurationFile=./log4j2.xml
docker image started
java.lang.AssertionError: Unexpected nvidia-smi response.
at org.pytorch.serve.util.ConfigManager.getAvailableGpu(ConfigManager.java:740)
at org.pytorch.serve.util.ConfigManager.
Error logs
Starting docker image
Warning: TorchServe is using non-default JVM parameters: -Dlog4j.configurationFile=./log4j2.xml
docker image started
java.lang.AssertionError: Unexpected nvidia-smi response.
at org.pytorch.serve.util.ConfigManager.getAvailableGpu(ConfigManager.java:740)
at org.pytorch.serve.util.ConfigManager.
Installation instructions
build docker like this: FROM pytorch/torchserve:0.6.0-gpu
Model Packaing
below is the command to package the model:
torch-model-archiver --force --model-name ${model_name_with_version} 
--version ${model_version} 
--handler model/inference-handler.py 
--serialized-file ${model_name_pth} 
--runtime python3 
--extra-files model/vocab.txt,model/bert_config.json,model/bert_tokenizer.py,model/model_config.json 
--export-path ${model_path}
config.properties
No response
Versions
pytorch/torchserve:0.6.0-gpu
Repro instructions
docker run --rm --shm-size=1g 
--ulimit memlock=-1 
--ulimit stack=67108864 
--ulimit nproc=1000 
--name neuron-inference-server 
-p8080:8080 
-p8081:8081 
-p8082:8082 
neuron-inference/inference-server:v1
torchserve --start --ts-config config.properties --model-store model-store/ 
--workflow-store model-store/ 
--models classify-document-eng=document-classify-eng-v4-release.mar 
kie-sgp-id=kie-sgp_id-v3-release-0531.mar
kie-sgp-id=kie-sgp_id-v1-release.mar  kie-sgp-passport=kie-sgp_passport-v1-release.mar
Possible Solution
No response
Hi @jack-gits the issue seems to be that nvidia-smi is not returning anything https://github.com/pytorch/serve/blob/master/frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java#L740 which may happen if your docker image has no access to a GPU. Can you confirm this is the case by running nvidia-smi from inside your container.
Make sure to also use the --gpus argument like here which should fix this https://github.com/pytorch/serve/blob/master/docker/README.md#start-gpu-container
I am wondering out loud whether it makes sense to have --gpus=all be the default behavior
I also encounter this error message: java.lang.AssertionError: Unexpected nvidia-smi response. But my docker image don't have gpu at all, so it's normal that running nvidia-smi returns nothing. How can I use or install a cpu-only torchserve version that don't have this kind of problem????
ERROR REPORT: Type 'nvidia-smi' to two different machine (both have no gpu),the first one response nothing and the next one response 'command not found'. But when I start the same torchserve(.mar file) in the same environment with same config, the first one report 'java.lang.AssertionError: Unexpected nvidia-smi response', but next one start normally. I think this may because of the the irrationality of the code. https://github.com/pytorch/serve/blob/master/frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java#L793
I also encounter this error message: java.lang.AssertionError: Unexpected nvidia-smi response. But my docker image don't have gpu at all, so it's normal that running nvidia-smi returns nothing. How can I use or install a cpu-only torchserve version that don't have this kind of problem????
ERROR REPORT: Type 'nvidia-smi' to two different machine (both have no gpu),the first one response nothing and the next one response 'command not found'. But when I start the same torchserve(.mar file) in the same environment with same config, the first one report 'java.lang.AssertionError: Unexpected nvidia-smi response', but next one start normally. I think this may because of the the irrationality of the code. https://github.com/pytorch/serve/blob/master/frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java#L793
set environment var "CUDA_VISIBLE_DEVICES" via option "-e" when you start docker. docker run -e CUDA_VISIBLE_DEVICES=',' -it your_image or in Dockerfile add ENV CUDA_VISIBLE_DEVICES=','