MoE-Infinity icon indicating copy to clipboard operation
MoE-Infinity copied to clipboard

[Feature Request]Explanation of Parallel Techniques

Open BaideBear opened this issue 9 months ago • 3 comments

Prerequisites

  • [x] I have searched existing issues and reviewed documentation.

Problem Description

May I ask what parallel techniques you have implemented? When setting CUDA_VISIBLE_DEVICES=0,1,2,3, all four cards have a certain load and are performing computations. I see that your TODO list states that expert parallelism is still being implemented. I don't quite understand the current behavior of the program. Here is the status of my compute cards: ` (test) moeserve@test~/MoE-Infinity$ CUDA_VISIBLE_DEVICES=0,1,2,3 python examples/interface_example.py --model_name_or_path "/home/moeserve/.cache/modelscope/hub/models/AI-ModelScope/Mixtral-8x7B-Instruct-v0.1" --offload_dir ~/offload/

Every 1.0s: nvidia-smi test-NF5468-A7-A0-R0-00: Wed Mar 19 10:45:27 2025

Wed Mar 19 10:45:27 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off | | 31% 44C P0 106W / 450W | 23974MiB / 49140MiB | 12% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off | | 30% 43C P0 100W / 450W | 23138MiB / 49140MiB | 10% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 4090 Off | 00000000:81:00.0 Off | Off | | 31% 42C P0 88W / 450W | 23158MiB / 49140MiB | 11% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce RTX 4090 Off | 00000000:C1:00.0 Off | Off | | 30% 43C P0 82W / 450W | 23408MiB / 49140MiB | 11% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3328026 C python 23790MiB | | 1 N/A N/A 3328026 C python 22954MiB | | 2 N/A N/A 3328026 C python 22974MiB | | 3 N/A N/A 3328026 C python 23224MiB | +-----------------------------------------------------------------------------------------+

`

Proposed Solution

I

Alternatives Considered

No response

Additional Context

No response

Importance

Nice to have

Usage Statistics (Optional)

No response

BaideBear avatar Mar 19 '25 02:03 BaideBear

I found this piece of code in the distributed module. Are there any other parallel techniques besides this?

total_gpus = torch.cuda.device_count()
      for expert_id in expert_list:
          gpu_id = expert_id % total_gpus
          self.expert_dispatcher.enqueue_expert(
              layer_id, expert_id, gpu_id, False
          )

      result = self.expert_dispatcher.wait_expert()

BaideBear avatar Mar 19 '25 03:03 BaideBear

The expert parallelism cross node is implemented but not tested; current behavior is expert_gpu_id = expert_id % num_gpu, as default option similar to deepspeed

drunkcoding avatar Mar 19 '25 13:03 drunkcoding

Hello to both of you! I am trying to run model-parallel (2 GPUs) inference with Mixtral, but it seems the current implementation is not automatically moving the activation from GPU0 to GPU1 when required. Do you know a workaround for this?

I dug a bit into the code, and it seems that the pre-forward hook function in self.lm_head of the Mixtral model is moving the activation from GPU0 to GPU1, even when the lm_head weight is in GPU0. The hook function seems to be coming from the cpp implementation, which I haven't had the chance to look at yet. Can you help me understand why this hook function necessary?

taehyunzzz avatar Mar 29 '25 16:03 taehyunzzz