Medusa icon indicating copy to clipboard operation
Medusa copied to clipboard

Roadmap

Open ctlllll opened this issue 2 years ago • 15 comments

Roadmap

Functionality

  • [x] #36
  • [x] #39
  • [ ] Distill from any model without access to the original training data
  • [ ] Batched inference
  • [ ] Fine-grained KV cache management

Integration

Local Deployment

  • [ ] #33
  • [ ] #32
  • [ ] #35

Serving

Research

  • [x] #34
  • [ ] Optimize the tree-based attention to reduce additional computation
  • [ ] Improve the acceptance scheme to generate more diverse sequences

ctlllll avatar Sep 12 '23 01:09 ctlllll

Looks like a promising roadmap. I think llama.cpp support should be held a higher priority

JianbangZ avatar Sep 12 '23 18:09 JianbangZ

Agree, that faster t/s is really important for llamacpp users.

Kimiko-AI avatar Sep 13 '23 15:09 Kimiko-AI

Would love to you Medusa be as a plugin of ooba's textgen webui for medusa head models

yhyu13 avatar Sep 13 '23 16:09 yhyu13

Would Medusa compatible with GPTQ quantized models?

Specifically, two Medusa heads fine-tuned on unquantized and quantized model, would they be the same? Or can they be swapped?

yhyu13 avatar Sep 13 '23 16:09 yhyu13

Would Medusa compatible with GPTQ quantized models?

Specifically, two Medusa heads fine-tuned on unquantized and quantized model, would they be the same? Or can they be swapped?

We didn't try this, but we can make an analogy to the 33B model we trained with bitsandbytes's 8-bit quantized base model, where the difference seems to be minor. Yet, more investigation is needed :)

ctlllll avatar Sep 13 '23 18:09 ctlllll

Please consider supporting quantized models, like GPTQ, AWQ, etc

aiapprentice101 avatar Sep 18 '23 12:09 aiapprentice101

Please consider supporting quantized models, like GPTQ, AWQ, etc

Thanks for the suggestion. Those models should be easily integrated just by loading the base model in those formats. We are trying to integrate Medusa into frameworks that the speed actually benefits from quantization, e.g., mlc-llm, llama.cpp.

ctlllll avatar Sep 18 '23 15:09 ctlllll

Please consider supporting quantized models, like GPTQ, AWQ, etc

Thanks for the suggestion. Those models should be easily integrated just by loading the base model in those formats. We are trying to integrate Medusa into frameworks that the speed actually benefits from quantization, e.g., mlc-llm, llama.cpp.

Exciting. Is there a timeline for llama.cpp support? your best guess?

JianbangZ avatar Sep 18 '23 15:09 JianbangZ

Please consider supporting quantized models, like GPTQ, AWQ, etc

Thanks for the suggestion. Those models should be easily integrated just by loading the base model in those formats. We are trying to integrate Medusa into frameworks that the speed actually benefits from quantization, e.g., mlc-llm, llama.cpp.

Exciting. Is there a timeline for llama.cpp support? your best guess?

We'll start with MLC-LLM first as it's more user-friendly for integration. For llama.cpp, we currently don't have the bandwidth to do it and it would be greatly appreciated if there were volunteers who could help us with it :)

ctlllll avatar Sep 18 '23 16:09 ctlllll

🎉 Exciting News! 🎉

We are thrilled to announce that we have received an award from Chai Research! While the monetary value may not be substantial, we are dedicating it as a token of our appreciation for the invaluable contributions made by our community. The funds will be allocated as development bounties to incentivize the achievement of key milestones.

🏆 First Bounty: Porting Medusa to Llama.cpp #35 🏆 Bounty Amount: $100

ctlllll avatar Sep 18 '23 23:09 ctlllll

Hello @ctlllll , Thanks for providing such a wonderful project. I am interested in the part of Fine-grained KV cache management. Could you offer me more guidance on this?

I have been working on a demo for SpeculativeSampling for a while.

https://github.com/feifeibear/LLMSpeculativeSampling

feifeibear avatar Sep 20 '23 09:09 feifeibear

Hello @ctlllll , Thanks for providing such a wonderful project. I am interested in the part of Fine-grained KV cache management. Could you offer me more guidance on this?

I have been working on a demo for SpeculativeSampling for a while.

https://github.com/feifeibear/LLMSpeculativeSampling

Hi @feifeibear , thanks for your interest! In the current version, we implemented a pre-allocated KV cache with the philosophy of keeping the original HF APIs and only for reducing the memory movement cost when updating KV cache. I think to be more dynamic, the PagedAttention mechanism in vllm might be a better reference :)

ctlllll avatar Sep 20 '23 17:09 ctlllll

Hey all, any updates on this?

nikshepsvn avatar Nov 21 '23 21:11 nikshepsvn

Hey all, any updates on this?

We have some exciting stuff baking now. Let's wait and see :p

ctlllll avatar Nov 21 '23 23:11 ctlllll

Hi, could sglang be placed on the roadmap too? It's a recent release also from lmsys who made vllm. But it's faster.

https://github.com/sgl-project/sglang

nivibilla avatar Jan 22 '24 07:01 nivibilla