llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

[4/n][torchtune integration] support lazy load model during inference

Open SLR722 opened this issue 11 months ago • 0 comments

What does this PR do?

In this PR, we refactor the meta reference inference logic to support

  • load the model during registering model instead of during spinning up server
  • support inference finetuned model checkpoint on top of native llama model

Why need these changes

To solve the existing pain points that

  • user cannot lazy load the model and hot switch the inference checkpoint after spinning up the server
    • this blocks us doing inference and eval on the same sever for a finetuned checkpoint after post training
  • user cannot do inference on a finetuned checkpoint on top of native llama models

Expect user experience change

  • The inference model won't be loaded when spinning up server. Instead, it will be loaded during register model. If user add the model as models resource in run.yaml, it will be registered and loaded automatically when starting server. There is an optional flag 'skip_initialize' in model metadata to skip model loading during registration.
  • There is an optional flag 'llama_model' in model metadata to identify the base model of the Model class for validation and initialize model arch. model identifier no longer needs to be a native llama model
  • the default inference model name updates from 'meta-llama/Llama-3.2-3B-Instruct' to 'Llama3.2-3B-Instruct'
    • It aligns with the checkpoint folder name after running 'llama model download'
    • It aligns with the descriptor name defined in llama-models SKU list https://github.com/meta-llama/llama-models/blob/bf5b0c4fe74e3b51ed5904ab65e3f671b194d2a9/models/datatypes.py#L95

test

run python llama_stack/scripts/distro_codegen.py

run unit test

  • torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="Llama3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
  • torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="Llama3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_model_registration.py

test post training experience on server side run: llama stack run llama_stack/templates/experimental-post-training/run.yaml server is spinning up without model loaded

Screenshot 2024-12-17 at 1 24 50 PM

on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 models register Llama3.2-3B-Instruct register model successfully and the model is loaded Screenshot 2024-12-17 at 1 26 30 PM

Screenshot 2024-12-17 at 1 26 09 PM

if add "skip_initialize" in metadata, model is registered but isn't loaded

on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 inference chat-completion --message "hello, what model are you?"

Inference the model succesfully Screenshot 2024-12-17 at 1 27 33 PM

test inference experience run: llama stack run llama_stack/templates/meta-reference-gpu/run.yaml model is loaded since the model is in resouce list in run.yaml Screenshot 2024-12-17 at 1 30 19 PM

on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 inference chat-completion --message "hello, what model are you?" inference successfully Screenshot 2024-12-17 at 1 31 08 PM

SLR722 avatar Dec 13 '24 05:12 SLR722