sglang [moe] optim: reduce memory consumption in fused

Motivation

This PR can partially address #3633.

Modifications

We reuse the memory of intermediate_cache1 to create intermediate_cache3.

Here is the test script

import torch
from sglang.srt.layers.moe.fused_moe_triton.fused_moe import fused_moe

N = 64 * 1024
E = 8
H = 4096
I = 8192

torch.manual_seed(0)

x = torch.randn((N, H), device="cuda", dtype=torch.float16) / 32
w1 = torch.randn((E, I * 2, H), device="cuda", dtype=torch.float16) / 32
w2 = torch.randn((E, H, I), device="cuda", dtype=torch.float16) / 32

gating_output = torch.randn((N, E), device="cuda", dtype=torch.float16)
topk = 2

x = fused_moe(x, w1, w2, gating_output, topk, True)

print(x)
print(torch.cuda.max_memory_allocated() // 1024 // 1024, "MB")

The output of the original implementation:

tensor([[-2.9869e-03,  3.7422e-03, -2.4395e-03,  ..., -2.0447e-03,
          8.6212e-03,  3.5362e-03],
        [-9.5520e-03,  6.5231e-03, -5.9586e-03,  ...,  1.5235e-04,
         -4.0359e-03,  5.0354e-03],
        [-5.5618e-03,  1.4296e-03, -6.3705e-03,  ..., -3.5400e-03,
         -4.6921e-03,  1.0918e-02],
        ...,
        [ 1.5354e-03,  7.7057e-03,  3.3035e-03,  ..., -1.1559e-03,
         -4.1962e-03, -1.9894e-03],
        [-9.8801e-03, -4.3716e-03,  8.8358e-04,  ...,  8.3847e-03,
         -8.6594e-04,  1.0101e-02],
        [-2.0733e-03,  9.3555e-04, -9.3162e-05,  ..., -1.1826e-03,
         -3.6907e-03, -4.7035e-03]], device='cuda:0', dtype=torch.float16)
9730 MB

PR reduces peak memory by 10.5%.

tensor([[-2.9869e-03,  3.7422e-03, -2.4395e-03,  ..., -2.0447e-03,
          8.6212e-03,  3.5362e-03],
        [-9.5520e-03,  6.5231e-03, -5.9586e-03,  ...,  1.5235e-04,
         -4.0359e-03,  5.0354e-03],
        [-5.5618e-03,  1.4296e-03, -6.3705e-03,  ..., -3.5400e-03,
         -4.6921e-03,  1.0918e-02],
        ...,
        [ 1.5354e-03,  7.7057e-03,  3.3035e-03,  ..., -1.1559e-03,
         -4.1962e-03, -1.9894e-03],
        [-9.8801e-03, -4.3716e-03,  8.8358e-04,  ...,  8.3847e-03,
         -8.6594e-04,  1.0101e-02],
        [-2.0733e-03,  9.3555e-04, -9.3162e-05,  ..., -1.1826e-03,
         -3.6907e-03, -4.7035e-03]], device='cuda:0', dtype=torch.float16)
8706 MB

Checklist

[ ] Format your code according to the Code Formatting with Pre-Commit.
[ ] Add unit tests as outlined in the Running Unit Tests.
[ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
[ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
[ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
[ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Feb 19 '25 09:02 ch-wan

This is important for deploying on services such as Railway or Fly.io which use IPV6 for internal networking.

If you connect to Redis using their internal networking (and IPV6), data transfers are free. Connecting using public networking (IPV4) is possible for now but incurs data transfer fees.

Dec 16 '24 02:12 ju-li

This is important for deploying on services such as Railway or Fly.io which use IPV6 for internal networking.

If you connect to Redis using their internal networking (and IPV6), data transfers are free. Connecting using public networking (IPV4) is possible for now but incurs data transfer fees.

Yeah, I do this changes on my forked repo, i will do a propose for this issue here

Dec 27 '24 18:12 iagocavalcante

This is important for deploying on services such as Railway or Fly.io which use IPV6 for internal networking. If you connect to Redis using their internal networking (and IPV6), data transfers are free. Connecting using public networking (IPV4) is possible for now but incurs data transfer fees.

Yeah, I do this changes on my forked repo, i will do a propose for this issue here

Hi @iagocavalcante Just wanted to check if you had a chance to propose for this issue. Greatly appreciate your work, man!

Feb 26 '25 20:02 ju-li

If we are to implement this, we could pass the family as param like this redis://127.0.0.1:6379?family=6, instead creating a new environment variable.

Feb 26 '25 20:02 Philipinho

@Philipinho yeah that would be ideal too. The issue is right now params are stripped out

Feb 26 '25 20:02 ju-li

Sorry for the delay, I send a PR with the suggested solution

Feb 26 '25 22:02 iagocavalcante

[moe] optim: reduce memory consumption in fused_moe

Motivation

Modifications

Checklist