MoE-Infinity [Feature Request]Explanation of Parallel Techniques

Prerequisites

[x] I have searched existing issues and reviewed documentation.

Problem Description

May I ask what parallel techniques you have implemented? When setting CUDA_VISIBLE_DEVICES=0,1,2,3, all four cards have a certain load and are performing computations. I see that your TODO list states that expert parallelism is still being implemented. I don't quite understand the current behavior of the program. Here is the status of my compute cards: ` (test) moeserve@test~/MoE-Infinity$ CUDA_VISIBLE_DEVICES=0,1,2,3 python examples/interface_example.py --model_name_or_path "/home/moeserve/.cache/modelscope/hub/models/AI-ModelScope/Mixtral-8x7B-Instruct-v0.1" --offload_dir ~/offload/

Every 1.0s: nvidia-smi test-NF5468-A7-A0-R0-00: Wed Mar 19 10:45:27 2025

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3328026 C python 23790MiB | | 1 N/A N/A 3328026 C python 22954MiB | | 2 N/A N/A 3328026 C python 22974MiB | | 3 N/A N/A 3328026 C python 23224MiB | +-----------------------------------------------------------------------------------------+

`

Proposed Solution

I

Alternatives Considered

No response

Additional Context

No response

Importance

Nice to have

Usage Statistics (Optional)

No response

Mar 19 '25 02:03 BaideBear

I found this piece of code in the distributed module. Are there any other parallel techniques besides this?

total_gpus = torch.cuda.device_count()
      for expert_id in expert_list:
          gpu_id = expert_id % total_gpus
          self.expert_dispatcher.enqueue_expert(
              layer_id, expert_id, gpu_id, False
          )

      result = self.expert_dispatcher.wait_expert()

Mar 19 '25 03:03 BaideBear

The expert parallelism cross node is implemented but not tested; current behavior is expert_gpu_id = expert_id % num_gpu, as default option similar to deepspeed

Mar 19 '25 13:03 drunkcoding

Hello to both of you! I am trying to run model-parallel (2 GPUs) inference with Mixtral, but it seems the current implementation is not automatically moving the activation from GPU0 to GPU1 when required. Do you know a workaround for this?

I dug a bit into the code, and it seems that the pre-forward hook function in self.lm_head of the Mixtral model is moving the activation from GPU0 to GPU1, even when the lm_head weight is in GPU0. The hook function seems to be coming from the cpp implementation, which I haven't had the chance to look at yet. Can you help me understand why this hook function necessary?

Mar 29 '25 16:03 taehyunzzz