lorax
lorax copied to clipboard
Mixtral nf4 performance 2x slower than expected
System Info
Latest Lorax version
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
Compare mistral-7b nf4 perf to mixtral nf4 perf
Expected behavior
On a single a6000 using bitsandbytes nf4 quantization I'm seeing 17ms per token with Mistral-7b and 80ms per token with Mixtral.
My expectation is that Mixtral should have the performance of roughly 2x the 7b model (around 40ms per token). Current perf levels seem to negate the advantage over a large non-MOE model.
Note that I'm also seeing this on TGI so it's not a Lorax-specific issue
Hey @timohear, thanks for reporting. I can definitely take a look some time this week. There are some differences in how Mixtral is implemented Mistral that might be contributing to the differences in perf, but some more thorough benchmarking on our side to see where the bottlenecks are coming from would be a good next step.