sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[moe] optim: reduce memory consumption in fused_moe

Open ch-wan opened this issue 10 months ago • 6 comments

Motivation

This PR can partially address #3633.

Modifications

We reuse the memory of intermediate_cache1 to create intermediate_cache3.

Here is the test script

import torch
from sglang.srt.layers.moe.fused_moe_triton.fused_moe import fused_moe

N = 64 * 1024
E = 8
H = 4096
I = 8192

torch.manual_seed(0)

x = torch.randn((N, H), device="cuda", dtype=torch.float16) / 32
w1 = torch.randn((E, I * 2, H), device="cuda", dtype=torch.float16) / 32
w2 = torch.randn((E, H, I), device="cuda", dtype=torch.float16) / 32

gating_output = torch.randn((N, E), device="cuda", dtype=torch.float16)
topk = 2

x = fused_moe(x, w1, w2, gating_output, topk, True)

print(x)
print(torch.cuda.max_memory_allocated() // 1024 // 1024, "MB")

The output of the original implementation:

tensor([[-2.9869e-03,  3.7422e-03, -2.4395e-03,  ..., -2.0447e-03,
          8.6212e-03,  3.5362e-03],
        [-9.5520e-03,  6.5231e-03, -5.9586e-03,  ...,  1.5235e-04,
         -4.0359e-03,  5.0354e-03],
        [-5.5618e-03,  1.4296e-03, -6.3705e-03,  ..., -3.5400e-03,
         -4.6921e-03,  1.0918e-02],
        ...,
        [ 1.5354e-03,  7.7057e-03,  3.3035e-03,  ..., -1.1559e-03,
         -4.1962e-03, -1.9894e-03],
        [-9.8801e-03, -4.3716e-03,  8.8358e-04,  ...,  8.3847e-03,
         -8.6594e-04,  1.0101e-02],
        [-2.0733e-03,  9.3555e-04, -9.3162e-05,  ..., -1.1826e-03,
         -3.6907e-03, -4.7035e-03]], device='cuda:0', dtype=torch.float16)
9730 MB

PR reduces peak memory by 10.5%.

tensor([[-2.9869e-03,  3.7422e-03, -2.4395e-03,  ..., -2.0447e-03,
          8.6212e-03,  3.5362e-03],
        [-9.5520e-03,  6.5231e-03, -5.9586e-03,  ...,  1.5235e-04,
         -4.0359e-03,  5.0354e-03],
        [-5.5618e-03,  1.4296e-03, -6.3705e-03,  ..., -3.5400e-03,
         -4.6921e-03,  1.0918e-02],
        ...,
        [ 1.5354e-03,  7.7057e-03,  3.3035e-03,  ..., -1.1559e-03,
         -4.1962e-03, -1.9894e-03],
        [-9.8801e-03, -4.3716e-03,  8.8358e-04,  ...,  8.3847e-03,
         -8.6594e-04,  1.0101e-02],
        [-2.0733e-03,  9.3555e-04, -9.3162e-05,  ..., -1.1826e-03,
         -3.6907e-03, -4.7035e-03]], device='cuda:0', dtype=torch.float16)
8706 MB

Checklist

  • [ ] Format your code according to the Code Formatting with Pre-Commit.
  • [ ] Add unit tests as outlined in the Running Unit Tests.
  • [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
  • [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
  • [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
  • [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

ch-wan avatar Feb 19 '25 09:02 ch-wan

This is important for deploying on services such as Railway or Fly.io which use IPV6 for internal networking.

If you connect to Redis using their internal networking (and IPV6), data transfers are free. Connecting using public networking (IPV4) is possible for now but incurs data transfer fees.

ju-li avatar Dec 16 '24 02:12 ju-li

This is important for deploying on services such as Railway or Fly.io which use IPV6 for internal networking.

If you connect to Redis using their internal networking (and IPV6), data transfers are free. Connecting using public networking (IPV4) is possible for now but incurs data transfer fees.

Yeah, I do this changes on my forked repo, i will do a propose for this issue here

iagocavalcante avatar Dec 27 '24 18:12 iagocavalcante

This is important for deploying on services such as Railway or Fly.io which use IPV6 for internal networking. If you connect to Redis using their internal networking (and IPV6), data transfers are free. Connecting using public networking (IPV4) is possible for now but incurs data transfer fees.

Yeah, I do this changes on my forked repo, i will do a propose for this issue here

Hi @iagocavalcante Just wanted to check if you had a chance to propose for this issue. Greatly appreciate your work, man!

ju-li avatar Feb 26 '25 20:02 ju-li

If we are to implement this, we could pass the family as param like this redis://127.0.0.1:6379?family=6, instead creating a new environment variable.

Philipinho avatar Feb 26 '25 20:02 Philipinho

@Philipinho yeah that would be ideal too. The issue is right now params are stripped out

ju-li avatar Feb 26 '25 20:02 ju-li

Sorry for the delay, I send a PR with the suggested solution

iagocavalcante avatar Feb 26 '25 22:02 iagocavalcante