lorax Mixtral nf4 performance 2x slower than expected

Mixtral nf4 performance 2x slower than expected

Open timohear opened this issue 1 year ago • 2 comments

trafficstars

System Info

Latest Lorax version

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Compare mistral-7b nf4 perf to mixtral nf4 perf

Expected behavior

On a single a6000 using bitsandbytes nf4 quantization I'm seeing 17ms per token with Mistral-7b and 80ms per token with Mixtral.

My expectation is that Mixtral should have the performance of roughly 2x the 7b model (around 40ms per token). Current perf levels seem to negate the advantage over a large non-MOE model.

Jan 29 '24 07:01 timohear

Note that I'm also seeing this on TGI so it's not a Lorax-specific issue

Jan 29 '24 07:01 timohear

Hey @timohear, thanks for reporting. I can definitely take a look some time this week. There are some differences in how Mixtral is implemented Mistral that might be contributing to the differences in perf, but some more thorough benchmarking on our side to see where the bottlenecks are coming from would be a good next step.

Jan 31 '24 07:01 tgaddair

lorax lorax copied to clipboard

Mixtral nf4 performance 2x slower than expected

System Info

Information

Tasks

Reproduction

Expected behavior

lorax
lorax copied to clipboard