ec-jt

Results 9 comments of ec-jt

Currently testing this on 8 local GPUs in one machine.

Include CUDA Compute 12.0 (sm120) 👍

Sure or alternatively can just send ops.

Repo is NVFP4/Qwen3-235B-A22B-Instruct-2507-FP4?

Upgraded to triton==3.3.1 and built tutel from source, examples are working fine but slow e.g. --nproc_per_node=2 -m tutel.examples.helloworld however launching llm_moe_tutel.py errors at backend.hpp:139 with sm120. LOCAL_SIZE=1 LAYER=1 TORCHDYNAMO_VERBOSE=1 NCCL_DEBUG=INFO...

python3 -m torch.distributed.run --nproc_per_node=2 -m tutel.examples.bandwidth_test --size_mb=1 ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the...

``` python3 -c 'import torch; print(torch.cuda.get_device_capability())' (12, 0) upgrading to libnccl2=2.26.2-1+cuda12.8 libnccl-dev=2.26.2-1+cuda12.8 fixed my p2p issues. ./alltoall_perf -g 3 # nThread 1 nGpus 3 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes)...

Thanks, you just need to upgrade to triton 3.3.1 then rebuild for it to work. I am looking to optimise with fused kernels and FP4 tensor cores, as you can...

I can look at sm120, is the format nvfp4/mxfp4? Could you share any PTX for fmoe_f16xf4_phase_1_v2.mod, fmoe_f16xf4_phase_2_v2.mod, gemv_nt_bf16xf4_block, gemm_nt_bf16xf4_block and to_float4_block?