less_slow.cpp
                                
                                
                                
                                    less_slow.cpp copied to clipboard
                            
                            
                            
                        Float FMA vs Integer DP4A & DPX Instructions
CUDA natively supports Fused-Multiply-Accumulate operations for every float type, including f16 and bf16. It also provides DP4A instructions for 8-bit integer dot-products with 32-bit accumulators and umul24 instructions for 24-bit integer multiplication. Starting with Hopper, Dynamic Programming eXtensitons (DPX) were added for combinatorial problems that can be used to implement Algebraic Graph Theory algorithms using matrix multiplications over alternative semirings.
How do those instructions stack up, and how much performance can we expect from recent State-of-the-Art GPUs like the Nvidia H200?
f64FMA: 4.5 Ti64FMA: 3.1 Tf32FMA: 22 Ti32FMA: 15.5 T ...so we should always prefer 32-bit opsu8u32DP4A: 39.3 Tu24u32UMUL: 13.4 T ...not really better thani32FMAf16FMA on Volta: 12.2 Tbf16FMA on Ampere: 12.2 T- DPX for Floyd-Warshall algorithm with 
u16andu32on Hopper: 11 T - DPX for Needleman-Wunsch algorithm with 
i16andi32on Hopper: 11 T - DPX for Smith-Waterman algorithm with 
i32on Hopper: 27 T 
Check the code and inline comments for more details!