Using vllm runtime generates "unrecognized arguments" error
An attempt to use vllm runtime generates the following error:
error: unrecognized arguments: llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file
Full log:
$ ramalama --debug --runtime vllm run llama3.2
exec_cmd: podman run --rm -i --label RAMALAMA --security-opt=label=disable --name ramalama_PNB6UFIqIM --pull=newer -t --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 --mount=type=bind,src=/home/dw/.local/share/ramalama/models/ollama/llama3.2:latest,destination=/mnt/models/model.file,ro quay.io/modh/vllm:rhoai-2.17-cuda llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file
usage: __main__.py [-h] [--host HOST] [--port PORT] [--uvicorn-log-level {debug,info,warning,error,critical,trace}] [--allow-credentials] [--allowed-origins ALLOWED_ORIGINS] [--allowed-methods ALLOWED_METHODS]
[--allowed-headers ALLOWED_HEADERS] [--api-key API_KEY] [--lora-modules LORA_MODULES [LORA_MODULES ...]] [--prompt-adapters PROMPT_ADAPTERS [PROMPT_ADAPTERS ...]] [--chat-template CHAT_TEMPLATE]
[--chat-template-content-format {auto,string,openai}] [--response-role RESPONSE_ROLE] [--ssl-keyfile SSL_KEYFILE] [--ssl-certfile SSL_CERTFILE] [--ssl-ca-certs SSL_CA_CERTS] [--ssl-cert-reqs SSL_CERT_REQS]
[--root-path ROOT_PATH] [--middleware MIDDLEWARE] [--return-tokens-as-token-ids] [--disable-frontend-multiprocessing] [--enable-request-id-headers] [--enable-auto-tool-choice]
[--tool-call-parser {granite-20b-fc,granite,hermes,internlm,jamba,llama3_json,mistral,pythonic} or name registered in --tool-parser-plugin] [--tool-parser-plugin TOOL_PARSER_PLUGIN] [--model MODEL]
[--task {auto,generate,embedding,embed,classify,score,reward}] [--tokenizer TOKENIZER] [--skip-tokenizer-init] [--revision REVISION] [--code-revision CODE_REVISION] [--tokenizer-revision TOKENIZER_REVISION]
[--tokenizer-mode {auto,slow,mistral}] [--trust-remote-code] [--allowed-local-media-path ALLOWED_LOCAL_MEDIA_PATH] [--download-dir DOWNLOAD_DIR]
[--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer}] [--config-format {auto,hf,mistral}] [--dtype {auto,half,float16,bfloat16,float,float32}]
[--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}] [--quantization-param-path QUANTIZATION_PARAM_PATH] [--max-model-len MAX_MODEL_LEN] [--guided-decoding-backend {outlines,lm-format-enforcer,xgrammar}]
[--logits-processor-pattern LOGITS_PROCESSOR_PATTERN] [--distributed-executor-backend {ray,mp}] [--worker-use-ray] [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE] [--tensor-parallel-size TENSOR_PARALLEL_SIZE]
[--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS] [--ray-workers-use-nsight] [--block-size {8,16,32,64,128}] [--enable-prefix-caching | --no-enable-prefix-caching] [--disable-sliding-window]
[--use-v2-block-manager] [--num-lookahead-slots NUM_LOOKAHEAD_SLOTS] [--seed SEED] [--swap-space SWAP_SPACE] [--cpu-offload-gb CPU_OFFLOAD_GB] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
[--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE] [--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS] [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS] [--disable-log-stats]
[--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,None}]
[--rope-scaling ROPE_SCALING] [--rope-theta ROPE_THETA] [--hf-overrides HF_OVERRIDES] [--enforce-eager] [--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE] [--disable-custom-all-reduce]
[--tokenizer-pool-size TOKENIZER_POOL_SIZE] [--tokenizer-pool-type TOKENIZER_POOL_TYPE] [--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG] [--limit-mm-per-prompt LIMIT_MM_PER_PROMPT]
[--mm-processor-kwargs MM_PROCESSOR_KWARGS] [--disable-mm-preprocessor-cache] [--enable-lora] [--enable-lora-bias] [--max-loras MAX_LORAS] [--max-lora-rank MAX_LORA_RANK]
[--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE] [--lora-dtype {auto,float16,bfloat16}] [--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS] [--max-cpu-loras MAX_CPU_LORAS] [--fully-sharded-loras]
[--enable-prompt-adapter] [--max-prompt-adapters MAX_PROMPT_ADAPTERS] [--max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN] [--device {auto,cuda,neuron,cpu,openvino,tpu,xpu,hpu}]
[--num-scheduler-steps NUM_SCHEDULER_STEPS] [--multi-step-stream-outputs [MULTI_STEP_STREAM_OUTPUTS]] [--scheduler-delay-factor SCHEDULER_DELAY_FACTOR] [--enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]]
[--speculative-model SPECULATIVE_MODEL]
[--speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,None}]
[--num-speculative-tokens NUM_SPECULATIVE_TOKENS] [--speculative-disable-mqa-scorer] [--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]
[--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN] [--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE] [--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]
[--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN] [--spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}]
[--typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD] [--typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA]
[--disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]] [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG] [--ignore-patterns IGNORE_PATTERNS] [--preemption-mode PREEMPTION_MODE]
[--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]] [--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH] [--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]
[--collect-detailed-traces COLLECT_DETAILED_TRACES] [--disable-async-output-proc] [--scheduling-policy {fcfs,priority}] [--override-neuron-config OVERRIDE_NEURON_CONFIG]
[--override-pooler-config OVERRIDE_POOLER_CONFIG] [--compilation-config COMPILATION_CONFIG] [--kv-transfer-config KV_TRANSFER_CONFIG] [--worker-cls WORKER_CLS] [--generation-config GENERATION_CONFIG]
[--disable-log-requests] [--max-log-len MAX_LOG_LEN] [--disable-fastapi-docs] [--enable-prompt-tokens-details] [--model-name MODEL_NAME] [--max-sequence-length MAX_SEQUENCE_LENGTH] [--max-new-tokens MAX_NEW_TOKENS]
[--max-batch-size MAX_BATCH_SIZE] [--max-concurrent-requests MAX_CONCURRENT_REQUESTS] [--dtype-str DTYPE_STR] [--quantize {awq,gptq,squeezellm,None}] [--num-gpus NUM_GPUS] [--num-shard NUM_SHARD]
[--output-special-tokens OUTPUT_SPECIAL_TOKENS] [--default-include-stop-seqs DEFAULT_INCLUDE_STOP_SEQS] [--grpc-port GRPC_PORT] [--tls-cert-path TLS_CERT_PATH] [--tls-key-path TLS_KEY_PATH]
[--tls-client-ca-cert-path TLS_CLIENT_CA_CERT_PATH] [--adapter-cache ADAPTER_CACHE] [--prefix-store-path PREFIX_STORE_PATH] [--speculator-name SPECULATOR_NAME] [--speculator-n-candidates SPECULATOR_N_CANDIDATES]
[--speculator-max-batch-size SPECULATOR_MAX_BATCH_SIZE] [--enable-vllm-log-requests ENABLE_VLLM_LOG_REQUESTS] [--disable-prompt-logprobs DISABLE_PROMPT_LOGPROBS]
__main__.py: error: unrecognized arguments: llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file
$ rpm -qv podman
podman-5.3.1-1.fc41.x86_64
$ rpm -qv python3-ramalama
python3-ramalama-0.5.5-1.fc41.noarch
$ rpm -qv golang-github-nvidia-container-toolkit
golang-github-nvidia-container-toolkit-1.16.2-1.fc41.x86_64
$ nvidia-ctk cdi list
INFO[0000] Found 3 CDI devices
nvidia.com/gpu=0
nvidia.com/gpu=GPU-9282fe1f-02bd-d793-11a8-5341a0858e3b
nvidia.com/gpu=all
Yes vllm can only do serve at this point.
Not even sure how well that works either.
It works for me. I installed ramalama from the latest source code.
$ rpm -q podman podman-5.3.1-1.fc41.x86_64
$ ramalama version ramalama version 0.7.5
$ ramalama --debug --runtime vllm run llama3.2
exec_cmd: podman run --rm --label ai.ramalama.model=llama3.2 --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.command=run --device /dev/dri --network none --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer --env "LLAMA_PROMPT_PREFIX=🦭 > " -t -i --label ai.ramalama --name ramalama_nYEXaTkuU6 --env=HOME=/tmp --init --label ai.ramalama.model=llama3.2 --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.command=run --mount=type=bind,src=/home/zguo/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/zguo/.local/share/ramalama/store/ollama/llama3.2/llama3.2/snapshots/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff/chat_template,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/ramalama:0.7 llama-run --jinja -c 2048 --temp 0.8 -v --threads 4 /mnt/models/model.file
Loading modelllama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /mnt/models/model.file (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 15
llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
A friendly reminder that this issue had no activity for 30 days.
@dwrobel still have this issue?
An attempt to run it as it was originally reported doesn't work:
$ ramalama --debug --runtime vllm run llama3.2
2025-07-29 08:07:30 - DEBUG - run_cmd: nvidia-smi
2025-07-29 08:07:30 - DEBUG - Working directory: None
2025-07-29 08:07:30 - DEBUG - Ignore stderr: False
2025-07-29 08:07:30 - DEBUG - Ignore all: False
2025-07-29 08:07:30 - DEBUG - Command finished with return code: 0
2025-07-29 08:07:30 - DEBUG - run_cmd: podman inspect quay.io/ramalama/cuda:0.11
2025-07-29 08:07:30 - DEBUG - Working directory: None
2025-07-29 08:07:30 - DEBUG - Ignore stderr: False
2025-07-29 08:07:30 - DEBUG - Ignore all: True
2025-07-29 08:07:30 - DEBUG - Checking if 8080 is available
2025-07-29 08:07:30 - DEBUG - Checking if 8080 is available
2025-07-29 08:07:30 - DEBUG - exec_cmd: podman run --rm --label ai.ramalama.model=llama3.2 --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.port=8080 --label ai.ramalama.command=run --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -d -i --label ai.ramalama --name ramalama_ECGGx6XfLh --env=HOME=/tmp --init --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff,destination=/mnt/models/llama3.2,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-34bb5ab01051a11372a91f95f3fbbc51173eed8e7f13ec395b9ae9b8bd0e242b,destination=/mnt/models/config.json,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-966de95ca8a62200913e3f8bfbf84c8494536f1b94b49166851e76644e966396,destination=/mnt/models/chat_template,ro quay.io/ramalama/cuda:latest --model /mnt/models/llama3.2 --port 8080 --max-sequence-length 2048 --max_model_len 2048 --served-model-name llama3.2
Error: crun: cannot stat `/usr/lib64/libEGL_nvidia.so.565.77`: No such file or directory: OCI runtime attempted to invoke a command that was not found
Error: Failed to serve model llama3.2, for ramalama run command
Trying to disable gpu also didn't help:
$ ramalama --debug --runtime vllm run --ngl 0 llama3.2
2025-07-29 08:07:43 - DEBUG - run_cmd: nvidia-smi
2025-07-29 08:07:43 - DEBUG - Working directory: None
2025-07-29 08:07:43 - DEBUG - Ignore stderr: False
2025-07-29 08:07:43 - DEBUG - Ignore all: False
2025-07-29 08:07:43 - DEBUG - Command finished with return code: 0
2025-07-29 08:07:43 - DEBUG - run_cmd: podman inspect quay.io/ramalama/cuda:0.11
2025-07-29 08:07:43 - DEBUG - Working directory: None
2025-07-29 08:07:43 - DEBUG - Ignore stderr: False
2025-07-29 08:07:43 - DEBUG - Ignore all: True
2025-07-29 08:07:44 - DEBUG - Checking if 8080 is available
2025-07-29 08:07:44 - DEBUG - Checking if 8080 is available
2025-07-29 08:07:44 - DEBUG - exec_cmd: podman run --rm --label ai.ramalama.model=llama3.2 --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.port=8080 --label ai.ramalama.command=run --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -d -i --label ai.ramalama --name ramalama_jNdHcLZrwn --env=HOME=/tmp --init --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff,destination=/mnt/models/llama3.2,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-34bb5ab01051a11372a91f95f3fbbc51173eed8e7f13ec395b9ae9b8bd0e242b,destination=/mnt/models/config.json,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-966de95ca8a62200913e3f8bfbf84c8494536f1b94b49166851e76644e966396,destination=/mnt/models/chat_template,ro quay.io/ramalama/cuda:latest --model /mnt/models/llama3.2 --port 8080 --max-sequence-length 2048 --max_model_len 2048 --served-model-name llama3.2
Error: crun: cannot stat `/usr/lib64/libEGL_nvidia.so.565.77`: No such file or directory: OCI runtime attempted to invoke a command that was not found
Error: Failed to serve model llama3.2, for ramalama run command
$ ls -ls /usr/lib64/libEGL_nvidia.so*
4 lrwxrwxrwx 1 root root 23 Jun 17 02:00 /usr/lib64/libEGL_nvidia.so.0 -> libEGL_nvidia.so.575.64
1328 -rwxr-xr-x 1 root root 1358016 Jun 10 20:40 /usr/lib64/libEGL_nvidia.so.575.64
Neither those below (produces the same error):
ramalama --debug --runtime vllm run --ngl 0 --device /dev/dri/renderD128 llama3.2
ramalama --debug --runtime vllm run --ngl 0 --device /dev/dri/renderD129 llama3.2
ramalama --debug --runtime vllm run --ngl 0 --image=quay.io/ramalama/intel-gpu llama3.2
It looks like a different issue as I'm running it on a different hardware than originally reported.
A friendly reminder that this issue had no activity for 30 days.
Can you try again with 0.12.1?
Usually the mismatch is caused by a screwed up cuda library install.
Can you try again with 0.12.1?
$ rpm -qv ramalama
ramalama-0.12.1-1.fc42.noarch
$ ramalama --debug --runtime vllm run llama3.2
2025-09-12 21:50:30 - DEBUG - run_cmd: nvidia-smi
2025-09-12 21:50:30 - DEBUG - Working directory: None
2025-09-12 21:50:30 - DEBUG - Ignore stderr: False
2025-09-12 21:50:30 - DEBUG - Ignore all: False
2025-09-12 21:50:31 - DEBUG - Command finished with return code: 0
2025-09-12 21:50:31 - DEBUG - run_cmd: podman inspect quay.io/ramalama/cuda:0.12
2025-09-12 21:50:31 - DEBUG - Working directory: None
2025-09-12 21:50:31 - DEBUG - Ignore stderr: False
2025-09-12 21:50:31 - DEBUG - Ignore all: True
2025-09-12 21:50:31 - DEBUG - Checking if 8080 is available
2025-09-12 21:50:31 - DEBUG - Checking if 8080 is available
2025-09-12 21:50:31 - DEBUG - exec_cmd: podman run --rm --label ai.ramalama.model=ollama://library/llama3.2:latest --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.port=8080 --label ai.ramalama.command=run --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 --runtime /usr/bin/nvidia-container-runtime -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -d -i --label ai.ramalama --name ramalama_hOiE8ftjHc --env=HOME=/tmp --init --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/library/llama3.2/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff,destination=/mnt/models/llama3.2,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/library/llama3.2/blobs/sha256-34bb5ab01051a11372a91f95f3fbbc51173eed8e7f13ec395b9ae9b8bd0e242b,destination=/mnt/models/config.json,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/library/llama3.2/blobs/sha256-966de95ca8a62200913e3f8bfbf84c8494536f1b94b49166851e76644e966396,destination=/mnt/models/chat_template,ro quay.io/ramalama/cuda:latest --model /mnt/models/llama3.2 --port 8080 --max-sequence-length 2048 --max_model_len 2048 --served-model-name llama3.2
9eeacd9b0ef0b7b2777c2f87bb358060c0084e02d2120e1b1370159ea0e3fe66
2025-09-12 21:50:32 - DEBUG - Waiting for container ramalama_hOiE8ftjHc to become healthy (timeout: 20s)...
send: b'GET /models HTTP/1.1\r\nHost: 127.0.0.1:8080\r\nAccept-Encoding: identity\r\n\r\n'
2025-09-12 21:50:32 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 104] Connection reset by peer
2025-09-12 21:50:33 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:34 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:35 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:36 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:37 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:38 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:39 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:40 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:41 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:42 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:43 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:44 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:45 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:46 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:47 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:48 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:49 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:50 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:51 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:52 - DEBUG - run_cmd: podman logs ramalama_hOiE8ftjHc
2025-09-12 21:50:52 - DEBUG - Working directory: None
2025-09-12 21:50:52 - DEBUG - Ignore stderr: False
2025-09-12 21:50:52 - DEBUG - Ignore all: False
Error: no container with name or ID "ramalama_hOiE8ftjHc" found: no such container
Error: Command '['podman', 'logs', 'ramalama_hOiE8ftjHc']' returned non-zero exit status 125.
It's on F42 with nvidia drivers from rpmfusion* and nvidia-container-toolkit from https://copr.fedorainfracloud.org/coprs/g/ai-ml/nvidia-container-toolkit/.
Grab 0.12.2 and try serve to get the error the service is hitting.
Here it comes:
$ rpm -qv ramalama
ramalama-0.12.2-1.fc42.noarch
$ ramalama --debug --runtime vllm run llama3.2
2025-09-18 09:16:38 - DEBUG - run_cmd: nvidia-smi
2025-09-18 09:16:38 - DEBUG - Working directory: None
2025-09-18 09:16:38 - DEBUG - Ignore stderr: False
2025-09-18 09:16:38 - DEBUG - Ignore all: False
2025-09-18 09:16:39 - DEBUG - Command finished with return code: 0
2025-09-18 09:16:39 - DEBUG - run_cmd: podman inspect quay.io/ramalama/cuda:0.12
2025-09-18 09:16:39 - DEBUG - Working directory: None
2025-09-18 09:16:39 - DEBUG - Ignore stderr: False
2025-09-18 09:16:39 - DEBUG - Ignore all: True
2025-09-18 09:16:39 - DEBUG - Checking if 8080 is available
2025-09-18 09:16:39 - DEBUG - Checking if 8080 is available
2025-09-18 09:16:39 - DEBUG - exec_cmd: podman run --rm --label ai.ramalama.model=ollama://library/llama3.2:latest --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.port=8080 --label ai.ramalama.command=run --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 --runtime /usr/bin/nvidia-container-runtime -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -d -i --label ai.ramalama --name ramalama_eNLXYd7owQ --env=HOME=/tmp --init --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/library/llama3.2/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff,destination=/mnt/models/llama3.2,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/library/llama3.2/blobs/sha256-34bb5ab01051a11372a91f95f3fbbc51173eed8e7f13ec395b9ae9b8bd0e242b,destination=/mnt/models/config.json,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/library/llama3.2/blobs/sha256-966de95ca8a62200913e3f8bfbf84c8494536f1b94b49166851e76644e966396,destination=/mnt/models/chat_template,ro quay.io/ramalama/cuda:latest --model /mnt/models/llama3.2 --port 8080 --max-sequence-length 0 --max_model_len 2048 --served-model-name llama3.2
eab6469c5771818d1572ad9ab97a7253775d0754de2e98aadd731da0c84ca642
2025-09-18 09:16:41 - DEBUG - Waiting for container ramalama_eNLXYd7owQ to become healthy (timeout: 20s)...
send: b'GET /models HTTP/1.1\r\nHost: 127.0.0.1:8080\r\nAccept-Encoding: identity\r\n\r\n'
2025-09-18 09:16:41 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 104] Connection reset by peer
2025-09-18 09:16:42 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:43 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:44 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:45 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:46 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:47 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:48 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:49 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:50 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:51 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:52 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:53 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:54 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:55 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:56 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:57 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:58 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:59 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:17:00 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:17:01 - DEBUG - run_cmd: podman logs ramalama_eNLXYd7owQ
2025-09-18 09:17:01 - DEBUG - Working directory: None
2025-09-18 09:17:01 - DEBUG - Ignore stderr: False
2025-09-18 09:17:01 - DEBUG - Ignore all: False
Error: no container with name or ID "ramalama_eNLXYd7owQ" found: no such container
Error: Command '['podman', 'logs', 'ramalama_eNLXYd7owQ']' returned non-zero exit status 125.
A friendly reminder that this issue had no activity for 30 days.