llama-stack
llama-stack copied to clipboard
failed to run inference on 3.2 11B vision model
Hi,
I'm trying to run inference on 11B vision model but faced this error running the client. I was able to download, build, config and run the server without issue.
Env AWS p4d instance 8xA100-40G, environment is conda.
torch 2.4.1
torchvision 0.19.1
llama_models 0.0.36
llama_stack 0.0.36
python -m llama_stack.apis.inference.client localhost 5000
User>hello world, write me a 2 sentence poem about the moon
Traceback (most recent call last):
File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpx/_transports/default.py", line 72, in map_httpcore_exceptions
yield
File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpx/_transports/default.py", line 377, in handle_async_request
resp = await self._pool.handle_async_request(req)
File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 216, in handle_async_request
raise exc from None
File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 196, in handle_async_request
response = await connection.handle_async_request(
File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_async/connection.py", line 99, in handle_async_request
raise exc
File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_async/connection.py", line 76, in handle_async_request
stream = await self._connect(request)
File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_async/connection.py", line 122, in _connect
stream = await self._network_backend.connect_tcp(**kwargs)
File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_backends/auto.py", line 30, in connect_tcp
return await self._backend.connect_tcp(
File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_backends/anyio.py", line 114, in connect_tcp
with map_exceptions(exc_map):
File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/home/ubuntu/conda/envs/llamastack-11b_vision/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ConnectError: All connection attempts failed
config file:
cat 11b_vision-run.yaml
built_at: '2024-09-27T08:23:01.418833'
image_name: 11b_vision
docker_image: null
conda_env: 11b_vision
apis_to_serve:
- inference
- agents
- memory_banks
- memory
- safety
- models
- shields
api_providers:
inference:
providers:
- meta-reference
safety:
providers:
- meta-reference
agents:
provider_id: meta-reference
config:
persistence_store:
namespace: null
type: sqlite
db_path: /home/ubuntu/.llama/runtime/kvstore.db
memory:
providers:
- meta-reference
telemetry:
provider_id: meta-reference
config: {}
routing_table:
inference:
- provider_id: meta-reference
config:
model: Llama3.2-11B-Vision-Instruct
quantization: null
torch_seed: null
max_seq_len: 4096
max_batch_size: 1
routing_key: Llama3.2-11B-Vision-Instruct
safety:
- provider_id: meta-reference
config:
llama_guard_shield: null
prompt_guard_shield: null
routing_key: llama_guard
- provider_id: meta-reference
config:
llama_guard_shield: null
prompt_guard_shield: null
routing_key: code_scanner_guard
- provider_id: meta-reference
config:
llama_guard_shield: null
prompt_guard_shield: null
routing_key: injection_shield
- provider_id: meta-reference
config:
llama_guard_shield: null
prompt_guard_shield: null
routing_key: jailbreak_shield
memory:
- provider_id: meta-reference
config: {}
routing_key: vector
build and run seems to be correct
Loaded in 16.81 seconds
Finished model load YES READY
Serving GET /healthcheck
Serving POST /inference/chat_completion
Serving POST /inference/completion
Serving POST /inference/embeddings
Serving GET /models/get
Serving GET /models/list
Serving GET /memory_banks/get
Serving GET /memory_banks/list
Serving POST /safety/run_shield
Serving GET /shields/get
Serving GET /shields/list
Serving GET /shields/get
Serving GET /shields/list
Serving POST /memory/create
Serving DELETE /memory/documents/delete
Serving DELETE /memory/drop
Serving GET /memory/documents/get
Serving GET /memory/get
Serving POST /memory/insert
Serving GET /memory/list
Serving POST /memory/query
Serving POST /memory/update
Serving POST /agents/create
Serving POST /agents/session/create
Serving POST /agents/turn/create
Serving POST /agents/delete
Serving POST /agents/session/delete
Serving POST /agents/session/get
Serving POST /agents/step/get
Serving POST /agents/turn/get
Serving GET /memory_banks/get
Serving GET /memory_banks/list
Serving GET /models/get
Serving GET /models/list
Listening on :::5000
INFO: Started server process [31452]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
there is so little docs and examples on how to serve/run inference with this new stack
When you are starting the server on AWS instance use the following option --disable-ipv6
Thanks @varunfb this is working now.
after reading the code, to run vision models we also need to pass the multi-modal args
llama stack run 11b_vision --port 5000 --disable-ipv6
python -m llama_stack.apis.inference.client localhost 5000 True True path_to_image