vLLM can't find model from llama download
System Info
It's using the versions downloaded by pip install during the llama stack build. I have an nvidia GPU
Information
- [X] The official example scripts
- [ ] My own modified scripts
🐛 Describe the bug
I do a llama download and obtain the Llama-3.2-3B model in ~/llama/checkpoints/Llama-3.2-3B Next I do llama stack build and select vllm as the inference engine, I run the following:
podman run -it --rm -p 5000:5000 --cgroup-conf=memory.high=28g --device nvidia.com/gpu=all -v /home/user/.llama/builds/docker/my-local-stack-run.yaml:/app/llamastack-run.yaml distribution-my-local-stack
It ends with: Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.2-3B/resolve/main/config.json. Access to model meta-llama/Llama-3.2-3B is restricted. You must have access to it and be authenticated to access it. Please log in.
I guess the build script does not find or include the path to the model as a volume to mount in the container. How do you provide the model in the place that vLLM was told it could be found?
Error logs
Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status response.raise_for_status() File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Llama-3.2-3B/resolve/main/config.json
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 347, in
Expected behavior
Would like it to find the model from the host system's ~/.llama directory and run it. This page https://github.com/meta-llama/llama-stack/blob/main/docs/cli_reference.md doesn't describe what to do with a downloaded model. The llama stack build command does not ask if the model is already downloaded when vllm is used (it could look in checkpoints and offer one of those as a default menu item). It might be an improvement to add that question and search for it in ~/.llama once the model name is known. And lastly, vLLM needs some arguments. (--chat-template, --max-model-len, --enable-auto-tool-choice, etc) I don't see an entry in the yaml that might hold the values that should be passed to it.
OK, done more investigating. If I do this: huggingface-cli download meta-llama/Llama-3.2-1B
Then I add: -v ~/.cache/huggingface/hub:/root/.cache/huggingface/hub to the basic podman command that brings this up. If I enter the container, I can see that it's there.
$ huggingface-cli scan-cache REPO ID REPO TYPE SIZE ON DISK NB FILES LAST_ACCESSED LAST_MODIFIED REFS LOCAL PATH
meta-llama/Llama-3.2-1B model 5.0G 13 4 hours ago 5 hours ago main /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B
Done in 0.0s. Scanned 1 repo(s) for a total of 5.0G.
However, starting the API server from there is a no go: [rank0]: Access to model meta-llama/Llama-3.2-1B is restricted. You must have access to it and be authenticated to access it. Please log in.
https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache says that you have to check for a cache like this:
from huggingface_hub import try_to_load_from_cache, _CACHED_NO_EXIST
filepath = try_to_load_from_cache() if isinstance(filepath, str): # file exists and is cached ... elif filepath is _CACHED_NO_EXIST: # non-existence of file is cached ... else: # file is not cached ...
But
$ cd llama-stack $ grep -rl try_to_load_from_cache * $
So, it doesn't look like there is a way to use a cached model.
- Why is there a "llama download" that puts models in the wrong place (from a transformers PoV)?
- Why isn't there code to use a cache if supplied? Most people download once and can't wait 10's of minutes on a fresh download.
- Why is there a difference in the model format that "llama download" creates and "huggingface-cli download" creates? Are they interchangeable?
So, it appears that the trick is that you have to inject your huggingface token into the environment of the container. Then it finds the cache and doesn't download. This is undocumented.
I also understand from reviewing the project that the model download and vllm support came at different times. I'd suggest updating the documentation to say that model download is purely to look at metadata and take it out of the getting started section. Reading the getting started doc it sounds like a required step - which it isn't. Or, pull a format that transformers can use and where they expect it and mention how to reuse it for local vllm.
I suppose this can be closed out. I now understand how to get local vLLm running, it's just entirely undocumented. vLLM is so complex, I think the remote vLLM code should be enabled so that it can be used. I removed the # and it and it just works. It was easier to get running than the local vllm. The local vLLM does not give you enough knobs for tuning how it works.
Thanks @stevegrubb -- all points correct! We will update the docs.
Do you think there's value in having local (inline) vLLM implementation at all?
Do you think there's value in having local (inline) vLLM implementation at all?
That's a tricky question. If you are doing something simple, there might be a good reason to pack vllm in the same container. But it is memory hungry: 22238MiB / 24564MiB and this is the 1B model. So, you'd really want to quantize the model (--quantization and --dtype) and tell vLLM not to do graphs (--enforce-eager=True). All of this would need to be documented. (Using huggingface-cli to download, where to mount it, injecting the HF token, in addition to passing in GPUs and a full docker/podman command line.) Also, while I'm thinking about this, the gpu_memory_utilization default is a bit low. I think it defaults to 0.3. It needs to be bigger unless you have a huge amount of vram.
I definitely think the remote::vllm is useful. You have full control of how the model gets served. You can add speculative decoding or even run on a different GPU. If you already have a vllm setup, all you need to do is point the url to it. The configuration of the container is much simpler since you do not need to enable a GPU, extra memory, inject the token, and place the model in the right spot. But you do need to pass "--network host" so that the llama-stack can contact the outside world.
Also, all demos I can find seem to require the safety API. If you have llama_guard_shield, where does that model get served?
Closing this out as I understand the whole recipe now. If there was something to track from this, feel free to re-open.
Yeah I will re-open it just so we track it for updating our docs. Thanks!