text-generation-inference
text-generation-inference copied to clipboard
Medusa models seem to be slower than the original base models
System Info
Thank you for adding support for Medusa. In my comparison of Medusa models versus the original base models with TGI, the latter appeared to be quicker.
I tested the below models:
- text-generation-inference/gemma-7b-it-medusa
- text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa
- text-generation-inference/Mistral-7B-Instruct-v0.2-medusa
- FasterDecoding/medusa-vicuna-7b-v1.3 ( revision="refs/pr/1" )
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
Command used :
docker run --gpus all --shm-size 1g -p 8081:80 ghcr.io/huggingface/text-generation-inference:1.4.3 --model-id text-generation-inference/Mistral-7B-Instruct-v0.2-medusa --num-shard 1
Hardware:
1xH100
Expected behavior
Medusa models should be faster than the original non-medusa models