tutel icon indicating copy to clipboard operation
tutel copied to clipboard

Support for Blackwell?

Open ec-jt opened this issue 4 months ago • 20 comments

I was looking at building kernels however I get RuntimeError: CUDA error: mapping of buffer object failed: Loading expert weights - 100.0% .. Syncing with other peers.. [rank4]: Traceback (most recent call last): [rank4]: File "/workspace/Tutel/llm_moe_tutel.py", line 433, in [rank4]: sigp = torch.ops.tutel_ops.uncached_exchange(sigp[0], net.simple_all_gather(sigp[1]), world_rank) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in call [rank4]: return self._op(*args, **(kwargs or {})) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: RuntimeError: CUDA error: mapping of buffer object failed [rank4]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [rank4]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [rank4]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ec-jt avatar Jul 28 '25 07:07 ec-jt

It should be compatible for Blackwell but strategies are not optimized for it.

How many cross-node machines do you use? And how many local GPUs are there per machines?

ghostplant avatar Jul 28 '25 07:07 ghostplant

Currently testing this on 8 local GPUs in one machine.

ec-jt avatar Jul 28 '25 08:07 ec-jt

The docker instance is not tested under blackwell for few months. Will dive into this in a day.

ghostplant avatar Jul 28 '25 08:07 ghostplant

Include CUDA Compute 12.0 (sm120) 👍

ec-jt avatar Jul 28 '25 14:07 ec-jt

I don't have the sm120 card, can you do me a favor to validate if some images can work well on that GPU?

ghostplant avatar Jul 29 '25 06:07 ghostplant

Sure or alternatively can just send ops.

ec-jt avatar Jul 29 '25 11:07 ec-jt

We found the image works well on sm_100, the error you show seems to be related to the failure due to non-standard GPU-GPU connection.

Q1: May I know if your multi-5090 environment support GPU-to-GPU P2P access?

Q2: Are you using the official Tutel docker image, or a customized environment?

Q3: Can you try which one below works and which one doesn't? (Please use the exact docker arguments without customization)

(Case-A)

docker run -e LOCAL_SIZE=1 -e LAYER=1 -it --rm --ipc=host --net=host --shm-size=8g --ulimit memlock=-1 \
      --ulimit stack=67108864 --gpus=all -v /:/host -w /host$(pwd) \
      tutelgroup/deepseek-671b:a100x8-chat-20250723 \
        --try_path Danucore/Qwen3-235B-A22B-Instruct-2507-FP4

(Case-B)

docker run -e LOCAL_SIZE=2 -e LAYER=1 -it --rm --ipc=host --net=host --shm-size=8g --ulimit memlock=-1 \
      --ulimit stack=67108864 --gpus=all -v /:/host -w /host$(pwd) \
      tutelgroup/deepseek-671b:a100x8-chat-20250723 \
        --try_path Danucore/Qwen3-235B-A22B-Instruct-2507-FP4

ghostplant avatar Jul 29 '25 12:07 ghostplant

Repo is NVFP4/Qwen3-235B-A22B-Instruct-2507-FP4?

ec-jt avatar Jul 29 '25 21:07 ec-jt

Yes, the model path on your local machine.

ghostplant avatar Jul 30 '25 05:07 ghostplant

@ec-jt The error issue is due to your GPUs don't have inter-GPU P2P copy support, so a slower path will be needed.

ghostplant avatar Jul 31 '25 20:07 ghostplant

Upgraded to triton==3.3.1 and built tutel from source, examples are working fine but slow e.g. --nproc_per_node=2 -m tutel.examples.helloworld however launching llm_moe_tutel.py errors at backend.hpp:139 with sm120.

LOCAL_SIZE=1 LAYER=1 TORCHDYNAMO_VERBOSE=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL CUDA_VISIBLE_DEVICES=0,1 OP_LOADER=/opt/deepseek-tutel-accel/ops.b200 python3 -m torch.distributed.run --nproc_per_node=1 llm_moe_tutel.py --buffer_size 32 --prompt "Calculate the indefinite integral of 1/sin(x) + x" --serve --listen_port 8000 --try_path /host/home/ubuntu/Qwen3-30B-A3B-Thinking-2507-FP4 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 CRITICAL:root:Registering device global rank 0: data_rank = 0, model_rank = 0 [INFO] Discover the model from local path: /host/home/ubuntu/Qwen3-30B-A3B-Thinking-2507-FP4, chosen as the default model. Loading shared weights - 0.0% .. Loading expert weights - 100.0% .. Synchronizing with other peers.. [CheckFail] /opt/deepseek-tutel-accel/Tutel/tutel/custom/backend.hpp:139 E0805 21:39:09.375000 2001 torch/distributed/elastic/multiprocessing/api.py:880] failed (exitcode: 1) local_rank: 0 (pid: 2067) of binary: /usr/bin/python3 Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 940, in main() File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 936, in main run(args) File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 927, in run elastic_launch( File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 151, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 288, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ llm_moe_tutel.py FAILED

Sanity check: python3 -m torch.distributed.run --nproc_per_node=2 -m tutel.examples.helloworld --batch_size=16 ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** CRITICAL:root:Registering device global rank 0: data_rank = 0, model_rank = 0 CRITICAL:root:Registering device global rank 1: data_rank = 0, model_rank = 1 [Statistics] param count for MoE local_experts = 16785408, param count for MoE gate = 8192.

ExampleModel( (_moe_layer): MOELayer( Top-K(s) = ['k=2, noise=0.0'], Total-Experts = 4 [managed by 2 device(s)], (experts): FusedExpertsNetwork(model_dim=2048, hidden_size=2048, output_dim=2048, num_experts_per_device=2. has_fc1_bias=True, has_fc2_bias=True.) (gates): ModuleList( (0): LinearTopKGate( (wg): Linear(in_features=2048, out_features=4, bias=False) ) ) ) ) [Benchmark] world_size = 2, dtype = float32, model_dim = 2048, hidden_size = 2048, samples = 8192, num_local_experts = 2, topK = 2, a2a_ffn_overlap_degree = 1, parallel_type = adaptive:1, device = cuda:0 STEP-0: loss = 22.34400, step_time = 3.500774 sec, perf = 0.24 tflops. STEP-1: loss = 12.98207, step_time = 0.149762 sec, perf = 5.51 tflops. STEP-2: loss = 5.02243, step_time = 0.139273 sec, perf = 5.92 tflops.

ec-jt avatar Aug 05 '25 23:08 ec-jt

Can you run these 2 cases to show your device capability?

python3 -m tutel.examples.helloworld --batch_size=16
python3 -m torch.distributed.run --nproc_per_node=2 -m tutel.examples.bandwidth_test --size_mb=1

ghostplant avatar Aug 06 '25 00:08 ghostplant

python3 -m torch.distributed.run --nproc_per_node=2 -m tutel.examples.bandwidth_test --size_mb=1


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


CRITICAL:root:Registering device global rank 1: data_rank = 0, model_rank = 1 CRITICAL:root:Registering device global rank 0: data_rank = 0, model_rank = 0 AllToAll average bandwidth across 2 node(s) = 2.7356 GB/s AllReduce average bandwidth across 2 node(s) = 1.9515 GB/s AllGather average bandwidth across 2 node(s) = 2.7832 GB/s ReduceScatter average bandwidth across 2 node(s) = 2.7650 GB/s

python3 -m tutel.examples.helloworld --batch_size=16 CRITICAL:root:Registering device global rank 0: data_rank = 0, model_rank = 0 [Statistics] param count for MoE local_experts = 16785408, param count for MoE gate = 4096.

ExampleModel( (_moe_layer): MOELayer( Top-K(s) = ['k=2, noise=0.0'], Total-Experts = 2 [managed by 1 device(s)], (experts): FusedExpertsNetwork(model_dim=2048, hidden_size=2048, output_dim=2048, num_experts_per_device=2. has_fc1_bias=True, has_fc2_bias=True.) (gates): ModuleList( (0): LinearTopKGate( (wg): Linear(in_features=2048, out_features=2, bias=False) ) ) ) ) [Benchmark] world_size = 1, dtype = float32, model_dim = 2048, hidden_size = 2048, samples = 8192, num_local_experts = 2, topK = 2, a2a_ffn_overlap_degree = 1, parallel_type = adaptive:1, device = cuda:0 STEP-0: loss = 22.87473, step_time = 3.396942 sec, perf = 0.24 tflops. STEP-1: loss = 16.20272, step_time = 0.024317 sec, perf = 33.91 tflops. STEP-2: loss = 10.79772, step_time = 0.023094 sec, perf = 35.71 tflops. STEP-3: loss = 6.32665, step_time = 0.023494 sec, perf = 35.10 tflops. STEP-4: loss = 3.05635, step_time = 0.022852 sec, perf = 36.09 tflops. STEP-5: loss = 1.24283, step_time = 0.022680 sec, perf = 36.36 tflops. STEP-6: loss = 0.31294, step_time = 0.022701 sec, perf = 36.33 tflops. STEP-7: loss = 0.06158, step_time = 0.022778 sec, perf = 36.20 tflops. STEP-8: loss = 0.04472, step_time = 0.022695 sec, perf = 36.33 tflops. STEP-9: loss = 0.03571, step_time = 0.022604 sec, perf = 36.48 tflops. STEP-10: loss = 0.02992, step_time = 0.022766 sec, perf = 36.22 tflops. STEP-11: loss = 0.02584, step_time = 0.023172 sec, perf = 35.59 tflops. STEP-12: loss = 0.02278, step_time = 0.023402 sec, perf = 35.24 tflops. STEP-13: loss = 0.02040, step_time = 0.023740 sec, perf = 34.74 tflops. STEP-14: loss = 0.01848, step_time = 0.023984 sec, perf = 34.38 tflops. STEP-15: loss = 0.01691, step_time = 0.024098 sec, perf = 34.22 tflops. STEP-16: loss = 0.01559, step_time = 0.024076 sec, perf = 34.25 tflops. STEP-17: loss = 0.01447, step_time = 0.024091 sec, perf = 34.23 tflops. STEP-18: loss = 0.01350, step_time = 0.024116 sec, perf = 34.19 tflops. STEP-19: loss = 0.01266, step_time = 0.024456 sec, perf = 33.72 tflops. STEP-20: loss = 0.01192, step_time = 0.026313 sec, perf = 31.34 tflops. STEP-21: loss = 0.01126, step_time = 0.024196 sec, perf = 34.08 tflops. STEP-22: loss = 0.01067, step_time = 0.024114 sec, perf = 34.20 tflops. STEP-23: loss = 0.01014, step_time = 0.024089 sec, perf = 34.23 tflops. STEP-24: loss = 0.00967, step_time = 0.024272 sec, perf = 33.98 tflops. STEP-25: loss = 0.00923, step_time = 0.024568 sec, perf = 33.56 tflops. STEP-26: loss = 0.00884, step_time = 0.024664 sec, perf = 33.43 tflops. STEP-27: loss = 0.00848, step_time = 0.024675 sec, perf = 33.42 tflops. STEP-28: loss = 0.00814, step_time = 0.024523 sec, perf = 33.63 tflops. STEP-29: loss = 0.00784, step_time = 0.024438 sec, perf = 33.74 tflops. STEP-30: loss = 0.00755, step_time = 0.024473 sec, perf = 33.70 tflops. STEP-31: loss = 0.00729, step_time = 0.024506 sec, perf = 33.65 tflops. STEP-32: loss = 0.00704, step_time = 0.024546 sec, perf = 33.60 tflops. STEP-33: loss = 0.00681, step_time = 0.024406 sec, perf = 33.79 tflops. STEP-34: loss = 0.00660, step_time = 0.024549 sec, perf = 33.59 tflops. STEP-35: loss = 0.00639, step_time = 0.024592 sec, perf = 33.53 tflops. STEP-36: loss = 0.00620, step_time = 0.024596 sec, perf = 33.53 tflops. STEP-37: loss = 0.00603, step_time = 0.024624 sec, perf = 33.49 tflops. STEP-38: loss = 0.00586, step_time = 0.024639 sec, perf = 33.47 tflops. STEP-39: loss = 0.00570, step_time = 0.024602 sec, perf = 33.52 tflops. STEP-40: loss = 0.00555, step_time = 0.024574 sec, perf = 33.56 tflops. STEP-41: loss = 0.00541, step_time = 0.024585 sec, perf = 33.54 tflops. STEP-42: loss = 0.00527, step_time = 0.024565 sec, perf = 33.57 tflops. STEP-43: loss = 0.00514, step_time = 0.024600 sec, perf = 33.52 tflops. STEP-44: loss = 0.00502, step_time = 0.024598 sec, perf = 33.52 tflops. STEP-45: loss = 0.00490, step_time = 0.024606 sec, perf = 33.51 tflops. STEP-46: loss = 0.00479, step_time = 0.024518 sec, perf = 33.63 tflops. STEP-47: loss = 0.00468, step_time = 0.024608 sec, perf = 33.51 tflops. STEP-48: loss = 0.00458, step_time = 0.024602 sec, perf = 33.52 tflops. STEP-49: loss = 0.00448, step_time = 0.024596 sec, perf = 33.53 tflops. STEP-50: loss = 0.00439, step_time = 0.024562 sec, perf = 33.57 tflops. STEP-51: loss = 0.00430, step_time = 0.024612 sec, perf = 33.51 tflops. STEP-52: loss = 0.00422, step_time = 0.024663 sec, perf = 33.44 tflops. STEP-53: loss = 0.00413, step_time = 0.024673 sec, perf = 33.42 tflops. STEP-54: loss = 0.00405, step_time = 0.024590 sec, perf = 33.53 tflops. STEP-55: loss = 0.00398, step_time = 0.024599 sec, perf = 33.52 tflops. STEP-56: loss = 0.00390, step_time = 0.024604 sec, perf = 33.52 tflops. STEP-57: loss = 0.00383, step_time = 0.024593 sec, perf = 33.53 tflops. STEP-58: loss = 0.00377, step_time = 0.024599 sec, perf = 33.52 tflops. STEP-59: loss = 0.00370, step_time = 0.024573 sec, perf = 33.56 tflops. STEP-60: loss = 0.00364, step_time = 0.024606 sec, perf = 33.51 tflops. STEP-61: loss = 0.00358, step_time = 0.024594 sec, perf = 33.53 tflops. STEP-62: loss = 0.00352, step_time = 0.024605 sec, perf = 33.52 tflops. STEP-63: loss = 0.00346, step_time = 0.024477 sec, perf = 33.69 tflops. STEP-64: loss = 0.00340, step_time = 0.024618 sec, perf = 33.50 tflops. STEP-65: loss = 0.00335, step_time = 0.024646 sec, perf = 33.46 tflops. STEP-66: loss = 0.00330, step_time = 0.024636 sec, perf = 33.47 tflops. STEP-67: loss = 0.00325, step_time = 0.024618 sec, perf = 33.50 tflops. STEP-68: loss = 0.00320, step_time = 0.024629 sec, perf = 33.48 tflops. STEP-69: loss = 0.00315, step_time = 0.024645 sec, perf = 33.46 tflops. STEP-70: loss = 0.00310, step_time = 0.024644 sec, perf = 33.46 tflops. STEP-71: loss = 0.00306, step_time = 0.024632 sec, perf = 33.48 tflops. STEP-72: loss = 0.00302, step_time = 0.024667 sec, perf = 33.43 tflops. STEP-73: loss = 0.00297, step_time = 0.024645 sec, perf = 33.46 tflops. STEP-74: loss = 0.00293, step_time = 0.024606 sec, perf = 33.51 tflops. STEP-75: loss = 0.00289, step_time = 0.024629 sec, perf = 33.48 tflops. STEP-76: loss = 0.00285, step_time = 0.024657 sec, perf = 33.44 tflops. STEP-77: loss = 0.00282, step_time = 0.024620 sec, perf = 33.49 tflops. STEP-78: loss = 0.00278, step_time = 0.024643 sec, perf = 33.46 tflops. STEP-79: loss = 0.00274, step_time = 0.024649 sec, perf = 33.45 tflops. STEP-80: loss = 0.00271, step_time = 0.024534 sec, perf = 33.61 tflops. STEP-81: loss = 0.00268, step_time = 0.024625 sec, perf = 33.49 tflops. STEP-82: loss = 0.00264, step_time = 0.024631 sec, perf = 33.48 tflops. STEP-83: loss = 0.00261, step_time = 0.024659 sec, perf = 33.44 tflops. STEP-84: loss = 0.00258, step_time = 0.024621 sec, perf = 33.49 tflops. STEP-85: loss = 0.00255, step_time = 0.024623 sec, perf = 33.49 tflops. STEP-86: loss = 0.00252, step_time = 0.024656 sec, perf = 33.45 tflops. STEP-87: loss = 0.00249, step_time = 0.024681 sec, perf = 33.41 tflops. STEP-88: loss = 0.00246, step_time = 0.024646 sec, perf = 33.46 tflops. STEP-89: loss = 0.00243, step_time = 0.024759 sec, perf = 33.31 tflops. STEP-90: loss = 0.00240, step_time = 0.024654 sec, perf = 33.45 tflops. STEP-91: loss = 0.00238, step_time = 0.024627 sec, perf = 33.49 tflops. STEP-92: loss = 0.00235, step_time = 0.024651 sec, perf = 33.45 tflops. STEP-93: loss = 0.00233, step_time = 0.024665 sec, perf = 33.43 tflops. STEP-94: loss = 0.00230, step_time = 0.024654 sec, perf = 33.45 tflops. STEP-95: loss = 0.00228, step_time = 0.024655 sec, perf = 33.45 tflops. STEP-96: loss = 0.00225, step_time = 0.024624 sec, perf = 33.49 tflops. STEP-97: loss = 0.00223, step_time = 0.024537 sec, perf = 33.61 tflops. STEP-98: loss = 0.00221, step_time = 0.024658 sec, perf = 33.44 tflops. STEP-99: loss = 0.00218, step_time = 0.024640 sec, perf = 33.47 tflops.

[Summary] Average synchronized step_time = 0.024636399000883102 sec.

The sm120 performance is low on examples and I have to copy missing kernels shown below that are not in ops/cuda for llm_moe to reach backend.hpp:139.

cp /opt/deepseek-tutel-accel/ops.b200/qwen3_norm_rotary_kvcache2_bf16.mod /opt/deepseek-tutel-accel/Tutel/build/lib.linux-x86_64-cpython-312/tutel/ops/cuda/ cp /opt/deepseek-tutel-accel/ops.b200/gemv_nt_bf16xfp8_block_v2.mod /opt/deepseek-tutel-accel/Tutel/build/lib.linux-x86_64-cpython-312/tutel/ops/cuda/ cp /opt/deepseek-tutel-accel/ops.b200/sig_allreduce_bf16_u2.mod /opt/deepseek-tutel-accel/Tutel/build/lib.linux-x86_64-cpython-312/tutel/ops/cuda/ cp /opt/deepseek-tutel-accel/ops.b200/fmoe_f16xf4_phase_1.mod /opt/deepseek-tutel-accel/Tutel/build/lib.linux-x86_64-cpython-312/tutel/ops/cuda/ cp /opt/deepseek-tutel-accel/ops.b200/fmoe_f16xf4_phase_2.mod

ec-jt avatar Aug 06 '25 08:08 ec-jt

Thanks! Single GPU doesn't seem slow, which is expected. But cross-GPU in your environment doesn't have P2P support, that communication bandwidth turns to be extremely slow. So I suggest avoiding introducing cross-GPU unless you want to experience a high GPU-GPU latency. I'll update the image to include sm_120 support, so that a single GPU can run Qwen3-30B-A3B-Thinking-2507-FP4 without problems. Finally, may I know what the output below command shows in your 5090 environment?

python3 -c 'import torch; print(torch.cuda.get_device_capability())'

ghostplant avatar Aug 06 '25 20:08 ghostplant

python3 -c 'import torch; print(torch.cuda.get_device_capability())'
(12, 0)

upgrading to libnccl2=2.26.2-1+cuda12.8 libnccl-dev=2.26.2-1+cuda12.8 fixed my p2p issues.

./alltoall_perf -g 3
# nThread 1 nGpus 3 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0       
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    33554400       2796200     float    none      -1    12811    2.62    1.75      0    12201    2.75    1.83    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.78975


ec-jt avatar Aug 07 '25 10:08 ec-jt

@ec-jt Thank you! Since your environment previously reports GPU-to-GPU direct access is not feasible, my guess on slow speed for any multi-GPU applications was correct.

Unlike server-level GPUs (empowered by NVlink/Infinity Fabric) that serves GPU-GPU transfer speed at > 200GB/s, you environment is around 3 GPU/s, which is about 70x slower. So unless you really want to enjoy huge models that cannot fit into a single GPU, I suggest run single-GPU models in most cases for speed consideration.

We just uploaded a new image version 20250808 in Tutel front page, which resolves the backend.hpp error for sm_120 you mentioned for 5090, and it also includes GPT-OSS-20B support (another good single-GPU model).

You can try 2 suggested models below (gpt-oss & qwen3-30b-a3b), both are suitable for single-GPU with limited memory:

docker run -e LOCAL_SIZE=1 -it --rm --ipc=host --shm-size=8g -p 8000:8000 \
     --ulimit memlock=-1 --ulimit stack=67108864 -v /:/host -w /host$(pwd) \
     -v /usr/lib/x86_64-linux-gnu/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1 --privileged \
     tutelgroup/deepseek-671b:a100x8-chat-20250808 --serve=webui --listen_port 8000 \
       --try_path ./openai/gpt-oss-20b \
       --try_path ./NVFP4/Qwen3-30B-A3B-Thinking-2507-FP4 \
       --prompt "Calculate the indefinite integral of 1/sin(x) + x"

I can't try it on sm_120, but I believe at least above 2 models should work, while other models may not.

ghostplant avatar Aug 07 '25 22:08 ghostplant

Thanks, you just need to upgrade to triton 3.3.1 then rebuild for it to work. I am looking to optimise with fused kernels and FP4 tensor cores, as you can see below FP8 is faster than FP4 so there is a significant amount of performance to be gained with capable FP4 tensor cores. I see you use autort, could you just give an explanation of the .mod generation e.g. gemv_nt_bf16xfp8_block_v2/fmoe_f16xf4_phase_2?

FP8>> Decode TPS Benchmark (bsz = 1, output_ctx = 64, n_gpus = 1): 260.5692 tokens/sec FP4>> Decode TPS Benchmark (bsz = 1, output_ctx = 64, n_gpus = 1): 242.5041 tokens/sec

ec-jt avatar Aug 07 '25 23:08 ec-jt

The conclusion of FP8/FP4 efficiency may not be true since they are untuned versions for your sm_120 card. (So FP4 may be faster than FP8 if properly tuned, and also improve the end-to-end numbers.)

Some module files are produced by public autort, such as qwen3_norm_rotary_kvcache2_bf16.mod / .., while fp8/fp4/sorting related modules are based on a patched autort depending a stack of patched environments such as triton. Due to heavy third-party dependencies not feasible to pack into autort installers as well as their license issues, the latter is currently not able to produce just with public autort, instead, we have the PTX version of those mod files instead of CUDA fatbin, if you want.

ghostplant avatar Aug 08 '25 07:08 ghostplant

I can look at sm120, is the format nvfp4/mxfp4? Could you share any PTX for fmoe_f16xf4_phase_1_v2.mod, fmoe_f16xf4_phase_2_v2.mod, gemv_nt_bf16xf4_block, gemm_nt_bf16xf4_block and to_float4_block?

ec-jt avatar Aug 08 '25 09:08 ec-jt

fmoe_f16xf4_phase_2.ptx.txt fmoe_f16xf4_phase_1.ptx.txt Makefile.txt

There is no gemv_nt_bf16xf4_block, gemm_nt_bf16xf4_block and to_float4_block since fp4 is not blockwise algorithm (unlike f16xfp8). In SM_120, only fmoe_f16xf4_phase_1.ptx and fmoe_f16xf4_phase_2.ptx are useful, while all other ops are currently directly using standard bf16@bf16.

ghostplant avatar Aug 11 '25 03:08 ghostplant