blog question about Mistral-8x7b inference

question about Mistral-8x7b inference

Open binbinxue opened this issue 1 year ago • 0 comments

Hello, quick question about the Mistral-8x7b inference, i've read it online that during inference it's using a router network to select just 2 of the models to produce the outputs, hence it's fast. I also know there're many techniques for batching requests such as continuous batching etc. Would huggingface inference engine batching functionality still works given that multiple inputs might trigger different models in a non-deterministic way due to the nature of a SMOE model

Dec 15 '23 10:12 binbinxue

blog blog copied to clipboard

question about Mistral-8x7b inference

blog
blog copied to clipboard