[moe] optim: reduce memory consumption in fused_moe
Motivation
This PR can partially address #3633.
Modifications
We reuse the memory of intermediate_cache1 to create intermediate_cache3.
Here is the test script
import torch
from sglang.srt.layers.moe.fused_moe_triton.fused_moe import fused_moe
N = 64 * 1024
E = 8
H = 4096
I = 8192
torch.manual_seed(0)
x = torch.randn((N, H), device="cuda", dtype=torch.float16) / 32
w1 = torch.randn((E, I * 2, H), device="cuda", dtype=torch.float16) / 32
w2 = torch.randn((E, H, I), device="cuda", dtype=torch.float16) / 32
gating_output = torch.randn((N, E), device="cuda", dtype=torch.float16)
topk = 2
x = fused_moe(x, w1, w2, gating_output, topk, True)
print(x)
print(torch.cuda.max_memory_allocated() // 1024 // 1024, "MB")
The output of the original implementation:
tensor([[-2.9869e-03, 3.7422e-03, -2.4395e-03, ..., -2.0447e-03,
8.6212e-03, 3.5362e-03],
[-9.5520e-03, 6.5231e-03, -5.9586e-03, ..., 1.5235e-04,
-4.0359e-03, 5.0354e-03],
[-5.5618e-03, 1.4296e-03, -6.3705e-03, ..., -3.5400e-03,
-4.6921e-03, 1.0918e-02],
...,
[ 1.5354e-03, 7.7057e-03, 3.3035e-03, ..., -1.1559e-03,
-4.1962e-03, -1.9894e-03],
[-9.8801e-03, -4.3716e-03, 8.8358e-04, ..., 8.3847e-03,
-8.6594e-04, 1.0101e-02],
[-2.0733e-03, 9.3555e-04, -9.3162e-05, ..., -1.1826e-03,
-3.6907e-03, -4.7035e-03]], device='cuda:0', dtype=torch.float16)
9730 MB
PR reduces peak memory by 10.5%.
tensor([[-2.9869e-03, 3.7422e-03, -2.4395e-03, ..., -2.0447e-03,
8.6212e-03, 3.5362e-03],
[-9.5520e-03, 6.5231e-03, -5.9586e-03, ..., 1.5235e-04,
-4.0359e-03, 5.0354e-03],
[-5.5618e-03, 1.4296e-03, -6.3705e-03, ..., -3.5400e-03,
-4.6921e-03, 1.0918e-02],
...,
[ 1.5354e-03, 7.7057e-03, 3.3035e-03, ..., -1.1559e-03,
-4.1962e-03, -1.9894e-03],
[-9.8801e-03, -4.3716e-03, 8.8358e-04, ..., 8.3847e-03,
-8.6594e-04, 1.0101e-02],
[-2.0733e-03, 9.3555e-04, -9.3162e-05, ..., -1.1826e-03,
-3.6907e-03, -4.7035e-03]], device='cuda:0', dtype=torch.float16)
8706 MB
Checklist
- [ ] Format your code according to the Code Formatting with Pre-Commit.
- [ ] Add unit tests as outlined in the Running Unit Tests.
- [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
This is important for deploying on services such as Railway or Fly.io which use IPV6 for internal networking.
If you connect to Redis using their internal networking (and IPV6), data transfers are free. Connecting using public networking (IPV4) is possible for now but incurs data transfer fees.
This is important for deploying on services such as Railway or Fly.io which use IPV6 for internal networking.
If you connect to Redis using their internal networking (and IPV6), data transfers are free. Connecting using public networking (IPV4) is possible for now but incurs data transfer fees.
Yeah, I do this changes on my forked repo, i will do a propose for this issue here
This is important for deploying on services such as Railway or Fly.io which use IPV6 for internal networking. If you connect to Redis using their internal networking (and IPV6), data transfers are free. Connecting using public networking (IPV4) is possible for now but incurs data transfer fees.
Yeah, I do this changes on my forked repo, i will do a propose for this issue here
Hi @iagocavalcante Just wanted to check if you had a chance to propose for this issue. Greatly appreciate your work, man!
If we are to implement this, we could pass the family as param like this redis://127.0.0.1:6379?family=6, instead creating a new environment variable.
@Philipinho yeah that would be ideal too. The issue is right now params are stripped out
Sorry for the delay, I send a PR with the suggested solution