MoE-Infinity issues

Dev

- release experts parallel version - correct README - support arctic and grok - remove installation dependency - remove circular dependency issue

drunkcoding

run on the mutiple gpus

3

Is there an unquantized version that can run on multiple GPUs?

YLSnowy

TODO for first release

- [x] API design - [x] Document for installation and PyPI - [x] performance table - [x] Support Mixtral multi-GPU - [ ] Load trace

drunkcoding

Introduce Local Server for OpenAI-Compatible APIs (Beta)

2

# Pull Request: Local Server Beta for OpenAI-Compatible APIs This PR introduces a beta version of a local server that provides OpenAI-compatible APIs, specifically `v1/chat/completions` and `v1/completions`. This initial version...

future-xy

Support Constrained Server Memory

Colab server T4 has 12GB DRAM, 16GB GPU, quantized mixtral has 26GB in size with single checkpoint, cannot bot be loaded into memory on creating the custom format for offloading

drunkcoding

enhancement

Can the MoE-Infinity framework be used in conjunction with the vLLM framework?

1

Because I am using vLLM server to deploy a MoE model. However, this model has a large number of experts and the number of activated experts is very small. So...

alphabewitch

enhancement

RuntimeError: CUDA error: invalid device ordinal. When I run script.py, I meet the error below.

2

Fetching 267 files: 100% 267/267 [00:00

Tingberer

Question: Support for Continuous Batching and Asynchronous Requests

1

Hi. I'm new to this LLM world. I have a few questions regarding the engine. Does it support continuous batching? I'm asking because I'm trying to set a request per...

Msiavashi

enhancement

feat: performance improvement and Qwen3 support

## Description Major changes for performance improvement ## Motivation - Support latest QWen3 MoE model - Overlap hidden states gather with expert copy - Reduce torch kernel launch overhead ##...

drunkcoding

MoE-Infinity
MoE-Infinity copied to clipboard

Metadata

Dev

Arctic Support

run on the mutiple gpus

TODO for first release

Introduce Local Server for OpenAI-Compatible APIs (Beta)

Support Constrained Server Memory

Can the MoE-Infinity framework be used in conjunction with the vLLM framework?

RuntimeError: CUDA error: invalid device ordinal. When I run script.py, I meet the error below.

Question: Support for Continuous Batching and Asynchronous Requests

feat: performance improvement and Qwen3 support

← Metadata

Owner

Metadata

MoE-Infinity MoE-Infinity copied to clipboard

Metadata

← Metadata

Owner

Metadata

MoE-Infinity
MoE-Infinity copied to clipboard