text-generation-inference Medusa models seem to be slower than the original base models

Medusa models seem to be slower than the original base models

Open infinitylogesh opened this issue 11 months ago • 0 comments

System Info

Thank you for adding support for Medusa. In my comparison of Medusa models versus the original base models with TGI, the latter appeared to be quicker.

I tested the below models:

text-generation-inference/gemma-7b-it-medusa
text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa
text-generation-inference/Mistral-7B-Instruct-v0.2-medusa
FasterDecoding/medusa-vicuna-7b-v1.3 ( revision="refs/pr/1" )

Screenshot 2024-03-13 at 11 11 00 PM

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Command used :

docker run --gpus all --shm-size 1g -p 8081:80 ghcr.io/huggingface/text-generation-inference:1.4.3 --model-id text-generation-inference/Mistral-7B-Instruct-v0.2-medusa --num-shard 1

Hardware:

1xH100

Expected behavior

Medusa models should be faster than the original non-medusa models

Mar 13 '24 17:03 infinitylogesh

text-generation-inference text-generation-inference copied to clipboard

Medusa models seem to be slower than the original base models

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard