transformers icon indicating copy to clipboard operation
transformers copied to clipboard

[ExecuTorch] Lower and run native Gemma e2e in ExecuTorch

Open guangy10 opened this issue 7 months ago • 0 comments

This PR is a prototype to showcase the minimal changes required to lower Gemma-2b to ExecuTorch w/ static kv cache and run it directly in llama runner w/o single line of code change in the ExecuTorch runtime.

By standardizing on the contract between HuggingFace modeling and ExecuTorch runtime, any LLM in HuggingFace could utilize llama runner as a universal runtime for a given backend.

Instructions to run the demo:

To run the demo, you need follow this guide to install ExecuTorch, patch PR#4088 to include minor bug fixes in ExecuTorch and the script export_hf_model.py there to export and lower the model. From there, you can:

  1. Run the export_hf_model.py to lower gemma-2b to ExecuTorch:
cd executorch # In the root dir in executorch
python -m examples.models.export_hf_model -hfm "google/gemma-2b" --export  # The model is exported statical dims with static KV cache
  1. Run the tokenizer.py to generate the binary format for ExecuTorch runtime:
python -m examples.models.llama2.tokenizer.tokenizer -t <path_to_downloaded_gemma_checkpoint_dir>/tokenizer.model -o <your_out_dir>/tokenizer.bin
  1. Build and run the lowered model wiht llama runner by following this guide step 4

NOTE: This prototype is to demonstrate the feasibility of exporting and running native HF model in ExecuTorch by reusing llama runner. It does NOT come with performance yet. It's an ongoing effort along this path to enable 1) delegations, e.g. xnnpack 2) custom sdpa 3) parallel prefill recently enabled in #4068.

guangy10 avatar Jun 29 '24 00:06 guangy10