llama-stack vLLM can't find model from llama download

System Info

It's using the versions downloaded by pip install during the llama stack build. I have an nvidia GPU

Information

[X] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

I do a llama download and obtain the Llama-3.2-3B model in ~/llama/checkpoints/Llama-3.2-3B Next I do llama stack build and select vllm as the inference engine, I run the following:

podman run -it --rm -p 5000:5000 --cgroup-conf=memory.high=28g --device nvidia.com/gpu=all -v /home/user/.llama/builds/docker/my-local-stack-run.yaml:/app/llamastack-run.yaml distribution-my-local-stack

It ends with: Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.2-3B/resolve/main/config.json. Access to model meta-llama/Llama-3.2-3B is restricted. You must have access to it and be authenticated to access it. Please log in.

I guess the build script does not find or include the path to the model as a volume to mount in the container. How do you provide the model in the place that vLLM was told it could be found?

Error logs

Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status response.raise_for_status() File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-3.2-3B/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 347, in fire.Fire(main) File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/usr/local/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 279, in main impls = asyncio.run(resolve_impls(config)) File "/usr/local/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/resolver.py", line 191, in resolve_impls impl = await instantiate_provider( File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/resolver.py", line 278, in instantiate_provider impl = await fn(*args) File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/vllm/init.py", line 16, in get_provider_impl await impl.initialize() File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/vllm/vllm.py", line 78, in initialize self.engine = AsyncLLMEngine.from_engine_args(engine_args) File "/usr/local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 664, in from_engine_args engine_config = engine_args.create_engine_config() File "/usr/local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 903, in create_engine_config model_config = self.create_model_config() File "/usr/local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 839, in create_model_config return ModelConfig( File "/usr/local/lib/python3.10/site-packages/vllm/config.py", line 162, in init self.hf_config = get_config(self.model, trust_remote_code, revision, File "/usr/local/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 148, in get_config if is_gguf or file_or_path_exists(model, File "/usr/local/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 86, in file_or_path_exists return file_exists(model, config_name, revision=revision, token=token) File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2907, in file_exists get_hf_file_metadata(url, token=token) File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1296, in get_hf_file_metadata r = _request_wrapper( File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 277, in _request_wrapper response = _request_wrapper( File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 301, in _request_wrapper hf_raise_for_status(response) File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status raise _format(GatedRepoError, message, response) from e

Expected behavior

Would like it to find the model from the host system's ~/.llama directory and run it. This page https://github.com/meta-llama/llama-stack/blob/main/docs/cli_reference.md doesn't describe what to do with a downloaded model. The llama stack build command does not ask if the model is already downloaded when vllm is used (it could look in checkpoints and offer one of those as a default menu item). It might be an improvement to add that question and search for it in ~/.llama once the model name is known. And lastly, vLLM needs some arguments. (--chat-template, --max-model-len, --enable-auto-tool-choice, etc) I don't see an entry in the yaml that might hold the values that should be passed to it.

Oct 29 '24 20:10 stevegrubb

OK, done more investigating. If I do this: huggingface-cli download meta-llama/Llama-3.2-1B

Then I add: -v ~/.cache/huggingface/hub:/root/.cache/huggingface/hub to the basic podman command that brings this up. If I enter the container, I can see that it's there.

$ huggingface-cli scan-cache REPO ID REPO TYPE SIZE ON DISK NB FILES LAST_ACCESSED LAST_MODIFIED REFS LOCAL PATH

meta-llama/Llama-3.2-1B model 5.0G 13 4 hours ago 5 hours ago main /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B

Done in 0.0s. Scanned 1 repo(s) for a total of 5.0G.

However, starting the API server from there is a no go: [rank0]: Access to model meta-llama/Llama-3.2-1B is restricted. You must have access to it and be authenticated to access it. Please log in.

https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache says that you have to check for a cache like this:

from huggingface_hub import try_to_load_from_cache, _CACHED_NO_EXIST

filepath = try_to_load_from_cache() if isinstance(filepath, str): # file exists and is cached ... elif filepath is _CACHED_NO_EXIST: # non-existence of file is cached ... else: # file is not cached ...

But

$ cd llama-stack $ grep -rl try_to_load_from_cache * $

So, it doesn't look like there is a way to use a cached model.

Why is there a "llama download" that puts models in the wrong place (from a transformers PoV)?
Why isn't there code to use a cache if supplied? Most people download once and can't wait 10's of minutes on a fresh download.
Why is there a difference in the model format that "llama download" creates and "huggingface-cli download" creates? Are they interchangeable?

Oct 31 '24 02:10 stevegrubb

So, it appears that the trick is that you have to inject your huggingface token into the environment of the container. Then it finds the cache and doesn't download. This is undocumented.

I also understand from reviewing the project that the model download and vllm support came at different times. I'd suggest updating the documentation to say that model download is purely to look at metadata and take it out of the getting started section. Reading the getting started doc it sounds like a required step - which it isn't. Or, pull a format that transformers can use and where they expect it and mention how to reuse it for local vllm.

I suppose this can be closed out. I now understand how to get local vLLm running, it's just entirely undocumented. vLLM is so complex, I think the remote vLLM code should be enabled so that it can be used. I removed the # and it and it just works. It was easier to get running than the local vllm. The local vLLM does not give you enough knobs for tuning how it works.

Nov 01 '24 02:11 stevegrubb

Thanks @stevegrubb -- all points correct! We will update the docs.

Do you think there's value in having local (inline) vLLM implementation at all?

Nov 01 '24 03:11 ashwinb

Do you think there's value in having local (inline) vLLM implementation at all?

That's a tricky question. If you are doing something simple, there might be a good reason to pack vllm in the same container. But it is memory hungry: 22238MiB / 24564MiB and this is the 1B model. So, you'd really want to quantize the model (--quantization and --dtype) and tell vLLM not to do graphs (--enforce-eager=True). All of this would need to be documented. (Using huggingface-cli to download, where to mount it, injecting the HF token, in addition to passing in GPUs and a full docker/podman command line.) Also, while I'm thinking about this, the gpu_memory_utilization default is a bit low. I think it defaults to 0.3. It needs to be bigger unless you have a huge amount of vram.

I definitely think the remote::vllm is useful. You have full control of how the model gets served. You can add speculative decoding or even run on a different GPU. If you already have a vllm setup, all you need to do is point the url to it. The configuration of the container is much simpler since you do not need to enable a GPU, extra memory, inject the token, and place the model in the right spot. But you do need to pass "--network host" so that the llama-stack can contact the outside world.

Also, all demos I can find seem to require the safety API. If you have llama_guard_shield, where does that model get served?

Nov 01 '24 16:11 stevegrubb

Closing this out as I understand the whole recipe now. If there was something to track from this, feel free to re-open.

Nov 01 '24 20:11 stevegrubb

Yeah I will re-open it just so we track it for updating our docs. Thanks!

Nov 01 '24 20:11 ashwinb