llama-stack failed to run inference on 3.2 11B vision model

Hi,

I'm trying to run inference on 11B vision model but faced this error running the client. I was able to download, build, config and run the server without issue.

Env AWS p4d instance 8xA100-40G, environment is conda.

torch                                    2.4.1
torchvision                              0.19.1
llama_models                             0.0.36
llama_stack                              0.0.36

python -m llama_stack.apis.inference.client localhost 5000
User>hello world, write me a 2 sentence poem about the moon
Traceback (most recent call last):
  File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpx/_transports/default.py", line 72, in map_httpcore_exceptions
    yield
  File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpx/_transports/default.py", line 377, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 216, in handle_async_request
    raise exc from None
  File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 196, in handle_async_request
    response = await connection.handle_async_request(
  File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_async/connection.py", line 99, in handle_async_request
    raise exc
  File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_async/connection.py", line 76, in handle_async_request
    stream = await self._connect(request)
  File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_async/connection.py", line 122, in _connect
    stream = await self._network_backend.connect_tcp(**kwargs)
  File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_backends/auto.py", line 30, in connect_tcp
    return await self._backend.connect_tcp(
  File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_backends/anyio.py", line 114, in connect_tcp
    with map_exceptions(exc_map):
  File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ConnectError: All connection attempts failed

config file: cat 11b_vision-run.yaml

built_at: '2024-09-27T08:23:01.418833'
image_name: 11b_vision
docker_image: null
conda_env: 11b_vision
apis_to_serve:
- inference
- agents
- memory_banks
- memory
- safety
- models
- shields
api_providers:
  inference:
    providers:
    - meta-reference
  safety:
    providers:
    - meta-reference
  agents:
    provider_id: meta-reference
    config:
      persistence_store:
        namespace: null
        type: sqlite
        db_path: /home/ubuntu/.llama/runtime/kvstore.db
  memory:
    providers:
 - meta-reference
  telemetry:
    provider_id: meta-reference
    config: {}
routing_table:
  inference:
  - provider_id: meta-reference
    config:
      model: Llama3.2-11B-Vision-Instruct
      quantization: null
      torch_seed: null
      max_seq_len: 4096
      max_batch_size: 1
    routing_key: Llama3.2-11B-Vision-Instruct
  safety:
  - provider_id: meta-reference
    config:
      llama_guard_shield: null
      prompt_guard_shield: null
    routing_key: llama_guard
  - provider_id: meta-reference
    config:
      llama_guard_shield: null
      prompt_guard_shield: null
routing_key: code_scanner_guard
  - provider_id: meta-reference
    config:
      llama_guard_shield: null
      prompt_guard_shield: null
    routing_key: injection_shield
  - provider_id: meta-reference
    config:
      llama_guard_shield: null
      prompt_guard_shield: null
    routing_key: jailbreak_shield
  memory:
  - provider_id: meta-reference
    config: {}
    routing_key: vector

build and run seems to be correct

Loaded in 16.81 seconds
Finished model load YES READY
Serving GET /healthcheck
Serving POST /inference/chat_completion
Serving POST /inference/completion
Serving POST /inference/embeddings
Serving GET /models/get
Serving GET /models/list
Serving GET /memory_banks/get
Serving GET /memory_banks/list
Serving POST /safety/run_shield
Serving GET /shields/get
Serving GET /shields/list
Serving GET /shields/get
Serving GET /shields/list
Serving POST /memory/create
Serving DELETE /memory/documents/delete
Serving DELETE /memory/drop
Serving GET /memory/documents/get
Serving GET /memory/get
Serving POST /memory/insert
Serving GET /memory/list
Serving POST /memory/query
Serving POST /memory/update
Serving POST /agents/create
Serving POST /agents/session/create
Serving POST /agents/turn/create
Serving POST /agents/delete
Serving POST /agents/session/delete
Serving POST /agents/session/get
Serving POST /agents/step/get
Serving POST /agents/turn/get
Serving GET /memory_banks/get
Serving GET /memory_banks/list
Serving GET /models/get
Serving GET /models/list
Listening on :::5000
INFO:     Started server process [31452]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)

Sep 27 '24 08:09 roywei

there is so little docs and examples on how to serve/run inference with this new stack

Sep 27 '24 14:09 tcapelle

When you are starting the server on AWS instance use the following option --disable-ipv6

Sep 27 '24 18:09 varunfb

Thanks @varunfb this is working now.

after reading the code, to run vision models we also need to pass the multi-modal args

llama stack run 11b_vision --port 5000 --disable-ipv6

python -m llama_stack.apis.inference.client  localhost 5000 True True path_to_image

Sep 28 '24 06:09 roywei