MoE-Infinity icon indicating copy to clipboard operation
MoE-Infinity copied to clipboard

PyTorch library for cost-effective, fast and easy serving of MoE models.

Results 24 MoE-Infinity issues
Sort by recently updated
recently updated
newest added

- release experts parallel version - correct README - support arctic and grok - remove installation dependency - remove circular dependency issue

Is there an unquantized version that can run on multiple GPUs?

- [x] API design - [x] Document for installation and PyPI - [x] performance table - [x] Support Mixtral multi-GPU - [ ] Load trace

# Pull Request: Local Server Beta for OpenAI-Compatible APIs This PR introduces a beta version of a local server that provides OpenAI-compatible APIs, specifically `v1/chat/completions` and `v1/completions`. This initial version...

Colab server T4 has 12GB DRAM, 16GB GPU, quantized mixtral has 26GB in size with single checkpoint, cannot bot be loaded into memory on creating the custom format for offloading

enhancement

Because I am using vLLM server to deploy a MoE model. However, this model has a large number of experts and the number of activated experts is very small. So...

enhancement

Hi. I'm new to this LLM world. I have a few questions regarding the engine. Does it support continuous batching? I'm asking because I'm trying to set a request per...

enhancement

## Description Major changes for performance improvement ## Motivation - Support latest QWen3 MoE model - Overlap hidden states gather with expert copy - Reduce torch kernel launch overhead ##...