H-Jamieu comments

Results 3 comments of


                                            H-Jamieu

About the theoretical value of the GPU

直接上结论：这个问题是pytorch官方默认的distro设定是`torch.backends.cuda.matmul.allow_tf32 = True`导致的。可以将这个变量设置为`true`解决问题。用`torch.set_float32_matmul_precision('high')`也可做到，不过这个是用bf16加速的。（opinion）也就是说4090在这个microbench里面80T的FP32算力是用tensorcore加速实现的，如果用cuda硬算大概是54T。参考1：https://pytorch.org/docs/stable/notes/cuda.html 参考2：https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html ----------------过时内容--------------------- 我也遇到了这个问题，目前的发现是： 1. 如果直接用pip install torch的官方命令，无论Windows还是linux下FP32都约为理论值的62.5%，炼丹实测数据与理论跑分一至。手上的3090/4090/4080/4060都一样，cuda版本从11.2-12.1都试过，pytorch从1.13-2.1都试过。 2. FP16与理论值接近。 3. 用nvidia pytorch可以跑出理论成绩。 4. 自己偶然编译过一版pytorch，cuda用的是10.2，3090在Windows原生下Fp32达到了标称算力。猜想：该问题可能和pytorch官方编译的轮子有关.

About the theoretical value of the GPU

experiments: ``` ENV: windows 11, python3.9 Pytorch version : 2.3.1+cu118 CUDA version : 11.8 GPU : NVIDIA GeForce RTX 4090 default n=128 n=512 n=2048 n=8192 n=16384 torch.float32 0.224 14.299 55.590...

FSDP2 with LoRA training throws dtype mismatch error

Same issue