ec-jt
ec-jt
Currently testing this on 8 local GPUs in one machine.
Include CUDA Compute 12.0 (sm120) 👍
Sure or alternatively can just send ops.
Repo is NVFP4/Qwen3-235B-A22B-Instruct-2507-FP4?
Upgraded to triton==3.3.1 and built tutel from source, examples are working fine but slow e.g. --nproc_per_node=2 -m tutel.examples.helloworld however launching llm_moe_tutel.py errors at backend.hpp:139 with sm120. LOCAL_SIZE=1 LAYER=1 TORCHDYNAMO_VERBOSE=1 NCCL_DEBUG=INFO...
python3 -m torch.distributed.run --nproc_per_node=2 -m tutel.examples.bandwidth_test --size_mb=1 ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the...
``` python3 -c 'import torch; print(torch.cuda.get_device_capability())' (12, 0) upgrading to libnccl2=2.26.2-1+cuda12.8 libnccl-dev=2.26.2-1+cuda12.8 fixed my p2p issues. ./alltoall_perf -g 3 # nThread 1 nGpus 3 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes)...
Thanks, you just need to upgrade to triton 3.3.1 then rebuild for it to work. I am looking to optimise with fused kernels and FP4 tensor cores, as you can...
I can look at sm120, is the format nvfp4/mxfp4? Could you share any PTX for fmoe_f16xf4_phase_1_v2.mod, fmoe_f16xf4_phase_2_v2.mod, gemv_nt_bf16xf4_block, gemm_nt_bf16xf4_block and to_float4_block?