What does this PR do?

given the work being done to support non-llama models, the download utility should be able to take any hf_repo/model to download a qualified model from HF. While the model might not be able to be used quite yet in llama stack directly, its helpful to have a utility that can download any and all models

relates to #965

Test Plan

I ran this locally:

╰─ llama model download --model-id instructlab/granite-7b-lab --source huggingface
README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.48k/7.48k [00:00<00:00, 75.6MB/s]
.gitattributes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.52k/1.52k [00:00<00:00, 42.2MB/s]
(…)2cf574a4a828140d3539ede4a_Untitled 1.png: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 206k/206k [00:00<00:00, 3.09MB/s]
(…)2cf574a4a828140d3539ede4a_Untitled 2.png: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 31.3k/31.3k [00:00<00:00, 17.7MB/s]
(…)b72cf574a4a828140d3539ede4a_Untitled.png: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 185k/185k [00:00<00:00, 3.54MB/s]
(…)72cf574a4a828140d3539ede4a_intuition.png: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 34.3k/34.3k [00:00<00:00, 3.31MB/s]
(…)Screenshot_2024-02-22_at_11.26.13_AM.png: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 394k/394k [00:00<00:00, 2.36MB/s]
paper.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201k/201k [00:00<00:00, 4.60MB/s]
Fetching 19 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:07<00:00,  2.55it/s]

Successfully downloaded model to /Users/charliedoern/.llama/checkpoints/instructlab/granite-7b-lab

Feb 07 '25 00:02 cdoern

@terrytangyuan good point, I wonder though what the future of llama model download is in a world where llama stack supports multiple model architectures. Would we expect users to be able to pull from different types of registries before interacting with LLS or would building that functionality into the llama cli make sense for usability?

I am thinking specifically things like OCI registries, HF, S3 buckets, etc. Could a catch-all utility cmd built into llama stack that provides general download support be useful for doing things like that?

cc @jaideepr97 since I know you are interested in things like this :)

Feb 24 '25 01:02 cdoern

+1 on needing a better story around model retrieval, personally

if the model management logic needs to be outsourced to an existing CLI, we could consider adding it to llama-stack-client at least There will be use cases that cannot rely on huggingface-cli, or need to be able to download models from other sources like @cdoern mentioned

Feb 24 '25 19:02 jaideepr97

Sorry for the long comment!

Ideally, we would let the inference provider encapsulate the logic of serving new models - downloading models and deploying them (allocating compute etc) as needed. This should be true for llama and non-llama models.

There are a few scenarios for a new model, from LS perspective:

Remote inference service (like together/fireworks): The provider code would then just need to call the service to start serving the new model.
Self-hosted inference service (like ollama/vllm/tgi) where model is hosted in huggingface/other repos and the inference service is spun up as part of llama stack distribution. In this case, we would probably need to enhance the provider to first download/pull the model and then deploy/serve the model. There would probably be some additional complexity in configuring parameters for model serving.
Model checkpoint created as part of finetuning in Llama Stack. In this case, llama stack has full knowledge about the model checkpoint. We were thinking of relying on the /files interface to allow developers to download the checkpoint using llama-stack-client. But its more likely that the developer wants to serve the finetuned model in an inference service that's already configured in the llama stack distro. We will need a way to simplify/encapsulate it in the provider code since there's a few different steps needed for each service. For example, see steps for ollama here: https://github.com/meta-llama/llama-stack/blob/main/docs/notebooks/Alpha_Llama_Stack_Post_Training.ipynb and Fireworks here: https://docs.fireworks.ai/models/uploading-custom-models#uploading-the-model

Maybe we can think through what it might take to have the vlm inference provider to serve a model available in HF and a finetuned model. That might help us get to the right abstractions.

The llama model download flow currently is mainly supporting downloading Meta-hosted models. Folks can also use huggingface-cli to download models from HF. Not sure we want to enhance llama model download itself. Thoughts?

Mar 06 '25 08:03 raghotham

I am with @terrytangyuan here. Do we have a very concrete use case for this extension? Without that, I think we should not open this can of worms right now -- this should be done as part of a broad workstream on doing model management. As @raghotham said, the main reason for the existence of this was to support downloading meta-hosted models only.

Mar 18 '25 22:03 ashwinb

This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

May 18 '25 00:05 github-actions[bot]

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it!

Jun 18 '25 00:06 github-actions[bot]

llama-stack
llama-stack copied to clipboard

refactor: support downloading any model from HF

What does this PR do?

Test Plan

llama-stack llama-stack copied to clipboard

refactor: support downloading any model from HF

What does this PR do?

Test Plan

llama-stack
llama-stack copied to clipboard