text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Medusa models seem to be slower than the original base models

Open infinitylogesh opened this issue 11 months ago • 0 comments

System Info

Thank you for adding support for Medusa. In my comparison of Medusa models versus the original base models with TGI, the latter appeared to be quicker.

I tested the below models:

  • text-generation-inference/gemma-7b-it-medusa
  • text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa
  • text-generation-inference/Mistral-7B-Instruct-v0.2-medusa
  • FasterDecoding/medusa-vicuna-7b-v1.3 ( revision="refs/pr/1" )

Screenshot 2024-03-13 at 11 11 00 PM

Information

  • [X] Docker
  • [ ] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

Command used :

docker run --gpus all --shm-size 1g -p 8081:80 ghcr.io/huggingface/text-generation-inference:1.4.3 --model-id text-generation-inference/Mistral-7B-Instruct-v0.2-medusa --num-shard 1 

Hardware:

1xH100

Expected behavior

Medusa models should be faster than the original non-medusa models

infinitylogesh avatar Mar 13 '24 17:03 infinitylogesh