ghostplant
ghostplant
I don't have the sm120 card, can you do me a favor to validate if some images can work well on that GPU?
We found the image works well on sm_100, the error you show seems to be related to the **failure due to non-standard GPU-GPU connection**. Q1: **May I know if your...
Yes, the model path on your local machine.
@ec-jt The error issue is due to your GPUs don't have inter-GPU P2P copy support, so a slower path will be needed.
Can you run these 2 cases to show your device capability? ``` python3 -m tutel.examples.helloworld --batch_size=16 ``` ``` python3 -m torch.distributed.run --nproc_per_node=2 -m tutel.examples.bandwidth_test --size_mb=1 ```
Thanks! Single GPU doesn't seem slow, which is expected. But cross-GPU in your environment doesn't have P2P support, that communication bandwidth turns to be extremely slow. So I suggest avoiding...
@ec-jt Thank you! Since your environment previously reports GPU-to-GPU direct access is not feasible, my guess on **slow speed for any multi-GPU applications** was correct. Unlike server-level GPUs (empowered by...
The conclusion of FP8/FP4 efficiency may not be true since they are untuned versions for your sm_120 card. (So FP4 may be faster than FP8 if properly tuned, and also...
[fmoe_f16xf4_phase_2.ptx.txt](https://github.com/user-attachments/files/21708890/fmoe_f16xf4_phase_2.ptx.txt) [fmoe_f16xf4_phase_1.ptx.txt](https://github.com/user-attachments/files/21708891/fmoe_f16xf4_phase_1.ptx.txt) [Makefile.txt](https://github.com/user-attachments/files/21708893/Makefile.txt) There is no gemv_nt_bf16xf4_block, gemm_nt_bf16xf4_block and to_float4_block since fp4 is not blockwise algorithm (unlike f16xfp8). In SM_120, only fmoe_f16xf4_phase_1.ptx and fmoe_f16xf4_phase_2.ptx are useful, while all other ops...
It is produced based on LLVM/NVVM, so there is not source code for them but PTX. Like Triton which is also based on LLVM, and I don't think they have...