ramalama Using vllm runtime generates "unrecognized arguments" error

An attempt to use vllm runtime generates the following error:

error: unrecognized arguments: llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file

Full log:

$ ramalama --debug --runtime vllm run llama3.2
exec_cmd:  podman run --rm -i --label RAMALAMA --security-opt=label=disable --name ramalama_PNB6UFIqIM --pull=newer -t --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 --mount=type=bind,src=/home/dw/.local/share/ramalama/models/ollama/llama3.2:latest,destination=/mnt/models/model.file,ro quay.io/modh/vllm:rhoai-2.17-cuda llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file
usage: __main__.py [-h] [--host HOST] [--port PORT] [--uvicorn-log-level {debug,info,warning,error,critical,trace}] [--allow-credentials] [--allowed-origins ALLOWED_ORIGINS] [--allowed-methods ALLOWED_METHODS]
                   [--allowed-headers ALLOWED_HEADERS] [--api-key API_KEY] [--lora-modules LORA_MODULES [LORA_MODULES ...]] [--prompt-adapters PROMPT_ADAPTERS [PROMPT_ADAPTERS ...]] [--chat-template CHAT_TEMPLATE]
                   [--chat-template-content-format {auto,string,openai}] [--response-role RESPONSE_ROLE] [--ssl-keyfile SSL_KEYFILE] [--ssl-certfile SSL_CERTFILE] [--ssl-ca-certs SSL_CA_CERTS] [--ssl-cert-reqs SSL_CERT_REQS]
                   [--root-path ROOT_PATH] [--middleware MIDDLEWARE] [--return-tokens-as-token-ids] [--disable-frontend-multiprocessing] [--enable-request-id-headers] [--enable-auto-tool-choice]
                   [--tool-call-parser {granite-20b-fc,granite,hermes,internlm,jamba,llama3_json,mistral,pythonic} or name registered in --tool-parser-plugin] [--tool-parser-plugin TOOL_PARSER_PLUGIN] [--model MODEL]
                   [--task {auto,generate,embedding,embed,classify,score,reward}] [--tokenizer TOKENIZER] [--skip-tokenizer-init] [--revision REVISION] [--code-revision CODE_REVISION] [--tokenizer-revision TOKENIZER_REVISION]
                   [--tokenizer-mode {auto,slow,mistral}] [--trust-remote-code] [--allowed-local-media-path ALLOWED_LOCAL_MEDIA_PATH] [--download-dir DOWNLOAD_DIR]
                   [--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer}] [--config-format {auto,hf,mistral}] [--dtype {auto,half,float16,bfloat16,float,float32}]
                   [--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}] [--quantization-param-path QUANTIZATION_PARAM_PATH] [--max-model-len MAX_MODEL_LEN] [--guided-decoding-backend {outlines,lm-format-enforcer,xgrammar}]
                   [--logits-processor-pattern LOGITS_PROCESSOR_PATTERN] [--distributed-executor-backend {ray,mp}] [--worker-use-ray] [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE] [--tensor-parallel-size TENSOR_PARALLEL_SIZE]
                   [--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS] [--ray-workers-use-nsight] [--block-size {8,16,32,64,128}] [--enable-prefix-caching | --no-enable-prefix-caching] [--disable-sliding-window]
                   [--use-v2-block-manager] [--num-lookahead-slots NUM_LOOKAHEAD_SLOTS] [--seed SEED] [--swap-space SWAP_SPACE] [--cpu-offload-gb CPU_OFFLOAD_GB] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
                   [--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE] [--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS] [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS] [--disable-log-stats]
                   [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,None}]
                   [--rope-scaling ROPE_SCALING] [--rope-theta ROPE_THETA] [--hf-overrides HF_OVERRIDES] [--enforce-eager] [--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE] [--disable-custom-all-reduce]
                   [--tokenizer-pool-size TOKENIZER_POOL_SIZE] [--tokenizer-pool-type TOKENIZER_POOL_TYPE] [--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG] [--limit-mm-per-prompt LIMIT_MM_PER_PROMPT]
                   [--mm-processor-kwargs MM_PROCESSOR_KWARGS] [--disable-mm-preprocessor-cache] [--enable-lora] [--enable-lora-bias] [--max-loras MAX_LORAS] [--max-lora-rank MAX_LORA_RANK]
                   [--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE] [--lora-dtype {auto,float16,bfloat16}] [--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS] [--max-cpu-loras MAX_CPU_LORAS] [--fully-sharded-loras]
                   [--enable-prompt-adapter] [--max-prompt-adapters MAX_PROMPT_ADAPTERS] [--max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN] [--device {auto,cuda,neuron,cpu,openvino,tpu,xpu,hpu}]
                   [--num-scheduler-steps NUM_SCHEDULER_STEPS] [--multi-step-stream-outputs [MULTI_STEP_STREAM_OUTPUTS]] [--scheduler-delay-factor SCHEDULER_DELAY_FACTOR] [--enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]]
                   [--speculative-model SPECULATIVE_MODEL]
                   [--speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,None}]
                   [--num-speculative-tokens NUM_SPECULATIVE_TOKENS] [--speculative-disable-mqa-scorer] [--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]
                   [--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN] [--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE] [--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]
                   [--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN] [--spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}]
                   [--typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD] [--typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA]
                   [--disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]] [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG] [--ignore-patterns IGNORE_PATTERNS] [--preemption-mode PREEMPTION_MODE]
                   [--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]] [--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH] [--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]
                   [--collect-detailed-traces COLLECT_DETAILED_TRACES] [--disable-async-output-proc] [--scheduling-policy {fcfs,priority}] [--override-neuron-config OVERRIDE_NEURON_CONFIG]
                   [--override-pooler-config OVERRIDE_POOLER_CONFIG] [--compilation-config COMPILATION_CONFIG] [--kv-transfer-config KV_TRANSFER_CONFIG] [--worker-cls WORKER_CLS] [--generation-config GENERATION_CONFIG]
                   [--disable-log-requests] [--max-log-len MAX_LOG_LEN] [--disable-fastapi-docs] [--enable-prompt-tokens-details] [--model-name MODEL_NAME] [--max-sequence-length MAX_SEQUENCE_LENGTH] [--max-new-tokens MAX_NEW_TOKENS]
                   [--max-batch-size MAX_BATCH_SIZE] [--max-concurrent-requests MAX_CONCURRENT_REQUESTS] [--dtype-str DTYPE_STR] [--quantize {awq,gptq,squeezellm,None}] [--num-gpus NUM_GPUS] [--num-shard NUM_SHARD]
                   [--output-special-tokens OUTPUT_SPECIAL_TOKENS] [--default-include-stop-seqs DEFAULT_INCLUDE_STOP_SEQS] [--grpc-port GRPC_PORT] [--tls-cert-path TLS_CERT_PATH] [--tls-key-path TLS_KEY_PATH]
                   [--tls-client-ca-cert-path TLS_CLIENT_CA_CERT_PATH] [--adapter-cache ADAPTER_CACHE] [--prefix-store-path PREFIX_STORE_PATH] [--speculator-name SPECULATOR_NAME] [--speculator-n-candidates SPECULATOR_N_CANDIDATES]
                   [--speculator-max-batch-size SPECULATOR_MAX_BATCH_SIZE] [--enable-vllm-log-requests ENABLE_VLLM_LOG_REQUESTS] [--disable-prompt-logprobs DISABLE_PROMPT_LOGPROBS]
__main__.py: error: unrecognized arguments: llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file

$ rpm -qv podman
podman-5.3.1-1.fc41.x86_64

$ rpm -qv python3-ramalama
python3-ramalama-0.5.5-1.fc41.noarch

$ rpm -qv golang-github-nvidia-container-toolkit
golang-github-nvidia-container-toolkit-1.16.2-1.fc41.x86_64

$ nvidia-ctk cdi list
INFO[0000] Found 3 CDI devices                          
nvidia.com/gpu=0
nvidia.com/gpu=GPU-9282fe1f-02bd-d793-11a8-5341a0858e3b
nvidia.com/gpu=all

Feb 07 '25 12:02 dwrobel

Yes vllm can only do serve at this point.

Feb 07 '25 20:02 rhatdan

Not even sure how well that works either.

Feb 07 '25 20:02 rhatdan

It works for me. I installed ramalama from the latest source code.

$ rpm -q podman podman-5.3.1-1.fc41.x86_64

$ ramalama version ramalama version 0.7.5

$  ramalama --debug --runtime vllm run llama3.2
exec_cmd:  podman run --rm --label ai.ramalama.model=llama3.2 --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.command=run --device /dev/dri --network none --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer --env "LLAMA_PROMPT_PREFIX=🦭 > " -t -i --label ai.ramalama --name ramalama_nYEXaTkuU6 --env=HOME=/tmp --init --label ai.ramalama.model=llama3.2 --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.command=run --mount=type=bind,src=/home/zguo/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff,destination=/mnt/models/model.file,ro --mount=type=bind,src=/home/zguo/.local/share/ramalama/store/ollama/llama3.2/llama3.2/snapshots/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff/chat_template,destination=/mnt/models/chat_template.file,ro quay.io/ramalama/ramalama:0.7 llama-run --jinja -c 2048 --temp 0.8 -v --threads 4 /mnt/models/model.file
Loading modelllama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /mnt/models/model.file (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium

Apr 28 '25 02:04 guoguojenna

A friendly reminder that this issue had no activity for 30 days.

Jul 25 '25 00:07 github-actions[bot]

@dwrobel still have this issue?

Jul 25 '25 10:07 rhatdan

An attempt to run it as it was originally reported doesn't work:

$ ramalama --debug --runtime vllm run llama3.2
2025-07-29 08:07:30 - DEBUG - run_cmd: nvidia-smi
2025-07-29 08:07:30 - DEBUG - Working directory: None
2025-07-29 08:07:30 - DEBUG - Ignore stderr: False
2025-07-29 08:07:30 - DEBUG - Ignore all: False
2025-07-29 08:07:30 - DEBUG - Command finished with return code: 0
2025-07-29 08:07:30 - DEBUG - run_cmd: podman inspect quay.io/ramalama/cuda:0.11
2025-07-29 08:07:30 - DEBUG - Working directory: None
2025-07-29 08:07:30 - DEBUG - Ignore stderr: False
2025-07-29 08:07:30 - DEBUG - Ignore all: True
2025-07-29 08:07:30 - DEBUG - Checking if 8080 is available
2025-07-29 08:07:30 - DEBUG - Checking if 8080 is available
2025-07-29 08:07:30 - DEBUG - exec_cmd: podman run --rm --label ai.ramalama.model=llama3.2 --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.port=8080 --label ai.ramalama.command=run --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -d -i --label ai.ramalama --name ramalama_ECGGx6XfLh --env=HOME=/tmp --init --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff,destination=/mnt/models/llama3.2,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-34bb5ab01051a11372a91f95f3fbbc51173eed8e7f13ec395b9ae9b8bd0e242b,destination=/mnt/models/config.json,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-966de95ca8a62200913e3f8bfbf84c8494536f1b94b49166851e76644e966396,destination=/mnt/models/chat_template,ro quay.io/ramalama/cuda:latest --model /mnt/models/llama3.2 --port 8080 --max-sequence-length 2048 --max_model_len 2048 --served-model-name llama3.2
Error: crun: cannot stat `/usr/lib64/libEGL_nvidia.so.565.77`: No such file or directory: OCI runtime attempted to invoke a command that was not found
Error: Failed to serve model llama3.2, for ramalama run command

Trying to disable gpu also didn't help:

$ ramalama --debug --runtime vllm run --ngl 0 llama3.2
2025-07-29 08:07:43 - DEBUG - run_cmd: nvidia-smi
2025-07-29 08:07:43 - DEBUG - Working directory: None
2025-07-29 08:07:43 - DEBUG - Ignore stderr: False
2025-07-29 08:07:43 - DEBUG - Ignore all: False
2025-07-29 08:07:43 - DEBUG - Command finished with return code: 0
2025-07-29 08:07:43 - DEBUG - run_cmd: podman inspect quay.io/ramalama/cuda:0.11
2025-07-29 08:07:43 - DEBUG - Working directory: None
2025-07-29 08:07:43 - DEBUG - Ignore stderr: False
2025-07-29 08:07:43 - DEBUG - Ignore all: True
2025-07-29 08:07:44 - DEBUG - Checking if 8080 is available
2025-07-29 08:07:44 - DEBUG - Checking if 8080 is available
2025-07-29 08:07:44 - DEBUG - exec_cmd: podman run --rm --label ai.ramalama.model=llama3.2 --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.port=8080 --label ai.ramalama.command=run --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -d -i --label ai.ramalama --name ramalama_jNdHcLZrwn --env=HOME=/tmp --init --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff,destination=/mnt/models/llama3.2,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-34bb5ab01051a11372a91f95f3fbbc51173eed8e7f13ec395b9ae9b8bd0e242b,destination=/mnt/models/config.json,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/llama3.2/llama3.2/blobs/sha256-966de95ca8a62200913e3f8bfbf84c8494536f1b94b49166851e76644e966396,destination=/mnt/models/chat_template,ro quay.io/ramalama/cuda:latest --model /mnt/models/llama3.2 --port 8080 --max-sequence-length 2048 --max_model_len 2048 --served-model-name llama3.2
Error: crun: cannot stat `/usr/lib64/libEGL_nvidia.so.565.77`: No such file or directory: OCI runtime attempted to invoke a command that was not found
Error: Failed to serve model llama3.2, for ramalama run command

$ ls -ls /usr/lib64/libEGL_nvidia.so*
   4 lrwxrwxrwx 1 root root      23 Jun 17 02:00 /usr/lib64/libEGL_nvidia.so.0 -> libEGL_nvidia.so.575.64
1328 -rwxr-xr-x 1 root root 1358016 Jun 10 20:40 /usr/lib64/libEGL_nvidia.so.575.64

Neither those below (produces the same error):

ramalama --debug --runtime vllm run --ngl 0 --device /dev/dri/renderD128 llama3.2
ramalama --debug --runtime vllm run --ngl 0 --device /dev/dri/renderD129 llama3.2
ramalama --debug --runtime vllm run --ngl 0 --image=quay.io/ramalama/intel-gpu  llama3.2

It looks like a different issue as I'm running it on a different hardware than originally reported.

Jul 29 '25 06:07 dwrobel

A friendly reminder that this issue had no activity for 30 days.

Aug 29 '25 00:08 github-actions[bot]

Can you try again with 0.12.1?

Usually the mismatch is caused by a screwed up cuda library install.

Sep 04 '25 11:09 rhatdan

Can you try again with 0.12.1?

$ rpm -qv ramalama
ramalama-0.12.1-1.fc42.noarch
$ ramalama --debug --runtime vllm run llama3.2
2025-09-12 21:50:30 - DEBUG - run_cmd: nvidia-smi
2025-09-12 21:50:30 - DEBUG - Working directory: None
2025-09-12 21:50:30 - DEBUG - Ignore stderr: False
2025-09-12 21:50:30 - DEBUG - Ignore all: False
2025-09-12 21:50:31 - DEBUG - Command finished with return code: 0
2025-09-12 21:50:31 - DEBUG - run_cmd: podman inspect quay.io/ramalama/cuda:0.12
2025-09-12 21:50:31 - DEBUG - Working directory: None
2025-09-12 21:50:31 - DEBUG - Ignore stderr: False
2025-09-12 21:50:31 - DEBUG - Ignore all: True
2025-09-12 21:50:31 - DEBUG - Checking if 8080 is available
2025-09-12 21:50:31 - DEBUG - Checking if 8080 is available
2025-09-12 21:50:31 - DEBUG - exec_cmd: podman run --rm --label ai.ramalama.model=ollama://library/llama3.2:latest --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.port=8080 --label ai.ramalama.command=run --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 --runtime /usr/bin/nvidia-container-runtime -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -d -i --label ai.ramalama --name ramalama_hOiE8ftjHc --env=HOME=/tmp --init --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/library/llama3.2/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff,destination=/mnt/models/llama3.2,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/library/llama3.2/blobs/sha256-34bb5ab01051a11372a91f95f3fbbc51173eed8e7f13ec395b9ae9b8bd0e242b,destination=/mnt/models/config.json,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/library/llama3.2/blobs/sha256-966de95ca8a62200913e3f8bfbf84c8494536f1b94b49166851e76644e966396,destination=/mnt/models/chat_template,ro quay.io/ramalama/cuda:latest --model /mnt/models/llama3.2 --port 8080 --max-sequence-length 2048 --max_model_len 2048 --served-model-name llama3.2
9eeacd9b0ef0b7b2777c2f87bb358060c0084e02d2120e1b1370159ea0e3fe66
2025-09-12 21:50:32 - DEBUG - Waiting for container ramalama_hOiE8ftjHc to become healthy (timeout: 20s)...
send: b'GET /models HTTP/1.1\r\nHost: 127.0.0.1:8080\r\nAccept-Encoding: identity\r\n\r\n'
2025-09-12 21:50:32 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 104] Connection reset by peer
2025-09-12 21:50:33 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:34 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:35 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:36 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:37 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:38 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:39 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:40 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:41 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:42 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:43 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:44 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:45 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:46 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:47 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:48 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:49 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:50 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:51 - DEBUG - Health check of container ramalama_hOiE8ftjHc failed, retrying... Error: [Errno 111] Connection refused
2025-09-12 21:50:52 - DEBUG - run_cmd: podman logs ramalama_hOiE8ftjHc
2025-09-12 21:50:52 - DEBUG - Working directory: None
2025-09-12 21:50:52 - DEBUG - Ignore stderr: False
2025-09-12 21:50:52 - DEBUG - Ignore all: False
Error: no container with name or ID "ramalama_hOiE8ftjHc" found: no such container
Error: Command '['podman', 'logs', 'ramalama_hOiE8ftjHc']' returned non-zero exit status 125.

It's on F42 with nvidia drivers from rpmfusion* and nvidia-container-toolkit from https://copr.fedorainfracloud.org/coprs/g/ai-ml/nvidia-container-toolkit/.

Sep 12 '25 19:09 dwrobel

Grab 0.12.2 and try serve to get the error the service is hitting.

Sep 13 '25 11:09 rhatdan

Here it comes:

$ rpm -qv ramalama
ramalama-0.12.2-1.fc42.noarch
$ ramalama --debug --runtime vllm run llama3.2
2025-09-18 09:16:38 - DEBUG - run_cmd: nvidia-smi
2025-09-18 09:16:38 - DEBUG - Working directory: None
2025-09-18 09:16:38 - DEBUG - Ignore stderr: False
2025-09-18 09:16:38 - DEBUG - Ignore all: False
2025-09-18 09:16:39 - DEBUG - Command finished with return code: 0
2025-09-18 09:16:39 - DEBUG - run_cmd: podman inspect quay.io/ramalama/cuda:0.12
2025-09-18 09:16:39 - DEBUG - Working directory: None
2025-09-18 09:16:39 - DEBUG - Ignore stderr: False
2025-09-18 09:16:39 - DEBUG - Ignore all: True
2025-09-18 09:16:39 - DEBUG - Checking if 8080 is available
2025-09-18 09:16:39 - DEBUG - Checking if 8080 is available
2025-09-18 09:16:39 - DEBUG - exec_cmd: podman run --rm --label ai.ramalama.model=ollama://library/llama3.2:latest --label ai.ramalama.engine=podman --label ai.ramalama.runtime=vllm --label ai.ramalama.port=8080 --label ai.ramalama.command=run --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 --runtime /usr/bin/nvidia-container-runtime -p 8080:8080 --security-opt=label=disable --cap-drop=all --security-opt=no-new-privileges --pull newer -t -d -i --label ai.ramalama --name ramalama_eNLXYd7owQ --env=HOME=/tmp --init --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/library/llama3.2/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff,destination=/mnt/models/llama3.2,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/library/llama3.2/blobs/sha256-34bb5ab01051a11372a91f95f3fbbc51173eed8e7f13ec395b9ae9b8bd0e242b,destination=/mnt/models/config.json,ro --mount=type=bind,src=/home/dw/.local/share/ramalama/store/ollama/library/llama3.2/blobs/sha256-966de95ca8a62200913e3f8bfbf84c8494536f1b94b49166851e76644e966396,destination=/mnt/models/chat_template,ro quay.io/ramalama/cuda:latest --model /mnt/models/llama3.2 --port 8080 --max-sequence-length 0 --max_model_len 2048 --served-model-name llama3.2
eab6469c5771818d1572ad9ab97a7253775d0754de2e98aadd731da0c84ca642
2025-09-18 09:16:41 - DEBUG - Waiting for container ramalama_eNLXYd7owQ to become healthy (timeout: 20s)...
send: b'GET /models HTTP/1.1\r\nHost: 127.0.0.1:8080\r\nAccept-Encoding: identity\r\n\r\n'
2025-09-18 09:16:41 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 104] Connection reset by peer
2025-09-18 09:16:42 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:43 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:44 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:45 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:46 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:47 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:48 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:49 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:50 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:51 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:52 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:53 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:54 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:55 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:56 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:57 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:58 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:16:59 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:17:00 - DEBUG - Health check of container ramalama_eNLXYd7owQ failed, retrying... Error: [Errno 111] Connection refused
2025-09-18 09:17:01 - DEBUG - run_cmd: podman logs ramalama_eNLXYd7owQ
2025-09-18 09:17:01 - DEBUG - Working directory: None
2025-09-18 09:17:01 - DEBUG - Ignore stderr: False
2025-09-18 09:17:01 - DEBUG - Ignore all: False
Error: no container with name or ID "ramalama_eNLXYd7owQ" found: no such container
Error: Command '['podman', 'logs', 'ramalama_eNLXYd7owQ']' returned non-zero exit status 125.

Sep 18 '25 07:09 dwrobel

A friendly reminder that this issue had no activity for 30 days.

Oct 19 '25 00:10 github-actions[bot]