clpeak
clpeak copied to clipboard
Half Precision not detected for RTX 3090
clpeak version: 1.1.2
Platform: NVIDIA CUDA
Device: NVIDIA GeForce RTX 3090
Driver version : 525.89.02 (Linux x64)
Compute units : 82
Clock frequency : 1725 MHz
Global memory bandwidth (GBPS)
float : 816.91
float2 : 841.68
float4 : 856.31
float8 : 785.62
float16 : 844.80
Single-precision compute (GFLOPS)
float : 35976.15
float2 : 35279.88
float4 : 35448.44
float8 : 35229.30
float16 : 34781.18
No half precision support! Skipped
Double-precision compute (GFLOPS)
double : 635.40
double2 : 634.58
double4 : 633.12
double8 : 630.11
double16 : 624.10
Integer compute (GIOPS)
int : 19650.09
int2 : 19531.53
int4 : 19486.43
int8 : 19548.59
int16 : 19539.19
Integer compute Fast 24bit (GIOPS)
int : 19452.70
int2 : 18920.43
int4 : 19145.33
int8 : 19143.94
int16 : 19075.51
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 9.96
enqueueReadBuffer : 10.48
enqueueWriteBuffer non-blocking : 5.47
enqueueReadBuffer non-blocking : 5.55
enqueueMapBuffer(for read) : 10.76
memcpy from mapped ptr : 15.20
enqueueUnmap(after write) : 13.04
memcpy to mapped ptr : 15.20
Kernel launch latency : 3.56 us
There is no native half-precision support on NVIDIA Ampere (except for A100) or Ada GPU. Their half-precision performance is the same as single-precision.
@moyang RTX 3090 has native FP16 support in tensor cores https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf
512 FP16 FMA per SM 128 FP16 FMA per Tensor core
RTX 3090 has 82 SM and 328 Tensor cores
@BA8F0D39 This seems to be a problem with NVIDIA's OpenCL implementation. When querying device capabilities by apps (like clpeak), it reports "no half-precision support". I observed the same issue with other benchmarks, like SiSoftware Sandra. .