HIP
HIP copied to clipboard
Query Regarding Fast-Math Optimization Support on AMD GPUs
Hello,
In the CUDA environment, compiling a program with nvcc using the -use_fast_math flag leverages the multi-function units (MFUs) of NVIDIA GPUs. This approach often results in a slightly less accurate, but faster executing, function.
I'm interested in whether there is a similar feature when programming with HIP on AMD GPUs. I noticed the HIP documentation mentions Floating-Point Intrinsics as being fast approximate functions. However, it wasn't explicitly stated whether these Intrinsics are compatible with AMD GPUs.
In my own tests, I observed a difference in results between a CUDA program run on an NVIDIA GPU with and without Intrinsics functions. This suggests that the Intrinsics are functioning as expected. However, when I ported the program to HIP and ran it on an AMD GPU, the use (or non-use) of Intrinsics functions did not affect the output, which was consistently precise.
Based on this, I'm led to assume that
fast-math optimizations might not be supported when programming with HIP on AMD GPUs.
Could you please confirm if my assumption is correct? If it is, are there any planned updates or workarounds to enable similar optimizations on AMD GPUs?
Thank you for your time, and I'm looking forward to your response.
You can use -ffast-math for HIP https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior
Hi @yxsamliu , thanks for your replying.
From the doc, I guess this optimization flag can only affect the CPU/host but not GPU/device codes?
The option is applied to both CPU and GPU code.
Thanks! I checked the listed options, it seems no optimization for math function like expf which is also optimized in NVIDIA devices. (Info here https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#intrinsic-functions). Do you have any idea on if AMD GPUs have such support? Thank you so much?
clang will link some specific device libraries when -ffast-math is specified, which defines some control variables affecting how these functions are implemented, e.g. https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/ocml/src/expF_base.h#L45
Ah, I see!
But I applied the fast math option when compiling with hipcc on the abovementioned program. It is supposed to produce a different answer when using fast math but no changes now.
I was using many math functions inside. I can show your the code tomorrow if you want since I don't have it in my hand now. Thanks!
Hi @yxsamliu , Here is my code:
#include "hip/hip_runtime.h"
/* This is a automatically generated test. Do not modify */
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
__global__
void compute(float comp, int var_1) {
for (int i=0; i < var_1; ++i) {
if (comp <= sinhf(powf(atan2f(atan2f(ceilf(-0.0f), powf(-0.0f, sinhf(ldexpf(asinf(sqrtf(sinf(log10f(atan2f(expf(ldexpf(coshf(sinhf(tanhf(powf(fmodf(ldexpf(sinhf(-1.0233E21f), 2), acosf(+1.4791E-37f)), log10f(log10f(+1.9760E-29f)))))), 2)), fabsf(sinf(logf(log10f(logf(acosf(ldexpf(+1.2971E8f, 2)))))))))))), 2)))), asinf(-1.7706E35f)), +0.0f))) {
comp += acosf(asinf(ceilf(asinf(-1.6714E-37f))));
}
}
uint32_t hex_val;
memcpy(&hex_val,&comp,sizeof(float));
// printf("0x%x\n",hex_val);
printf("%.17g\n",comp);
}
float* initPointer(float v) {
float *ret = (float*) malloc(sizeof(float)*10);
for(int i=0; i < 10; ++i)
ret[i] = v;
return ret;
}
int main(int argc, char** argv) {
/* Program variables */
float tmp_1 = atof(argv[1]);
int tmp_2 = atoi(argv[2]);
compute<<<1,1>>>(tmp_1,tmp_2);
hipDeviceSynchronize();
hipError_t err = hipDeviceSynchronize();
if (err != hipSuccess) {
fprintf(stderr, "Error: %s\n", hipGetErrorString(err));
return 1;
}
return 0;
}
The corresponding CUDA program will get 1.5707963705062866 without -use_fast_math and -1.504299955062057e-37 with -use_fast_math. However, w/wo using -ffast_math we will get 1.5707963705062866 on AMD GPUs. So I doubt if -ffast_math took the effect.
It would be good to dump the assembly code to check if an optimization is taken. However, when I tried roc-obj-ls ./test.hip.exe | roc-obj-extrac
as here indicates, it gives me error:
Error: '1 host-x86_64-unknown-linux file' is not recognized as a supported code object URI.
Error: '1 hipv4-amdgcn-amd-amdhsa--gfx908 file' is not recognized as a supported code object URI.
Any help is appreciated! Thanks!
you can use -save-temps to compile your program. The assembly can be found in the *-gfx908.s file.
Hi @yxsamliu ,
Thanks for your replying. I just checked the intermediate code and found there's only one line difference w/wo using -ffast-math
:
< v_cmp_nge_f32_e32 vcc, s4, v36
---
> v_cmp_lt_f32_e32 vcc, s4, v36
Does that mean -ffast-math didn't do any optimizations on the math functions? Thanks!
Your floating-point computation is all computing floating-point constants. clang will optimize them and get results at compile time, which is always accurate. That's why you cannot see the difference.
Try replacing those fp constants as kernel arguments to avoid clang optimizing them at compile time.
Hi @yxsamliu, I've changed my computation code as
if (comp <= sinhf(powf(atan2f(atan2f(ceilf(comp), powf(var_1, sinhf(ldexpf(asinf(sqrtf(sinf(log10f(atan2f(expf(ldexpf(coshf(sinhf(tanhf(powf(fmodf(ldexpf(sinhf(-1.0233E21f), 2), acosf(+1.4791E-37f)), log10f(log10f(comp)))))), 2)), fabsf(sinf(logf(log10f(logf(acosf(ldexpf(+1.2971E8f, 2)))))))))))), 2)))), asinf(-1.7706E35f)), +0.0f))) {
Still I didn't observe any differences in the .s
files.
I was running the programs on MI100. Will it affect the usage of specific math API? I mean, does that mean MI100 doesn't provide the mutli-function mathematical units as NVIDIA GPUs? Thanks!
The expression you mentioned can be simplified to powrf(x, 0), which the compiler can optimize to 1, which does not depend on -ffast-math. https://godbolt.org/z/sWEYdbKcn
try change that +0.0f to a kernel arg
also, your final result mostly depends on acosf(asinf(ceilf(asinf(-1.6714E-37f)))), the -1.6714E-37f should be changed to a kernel argument too. Otherwise you may not see difference due to -ffast-math
Hi @yxsamliu, Thank you for your patient and considerate response. I've modified the codes based on your suggestions and now I'm noticing the differences!
Presently, I'm in the process of evaluating the numerical computation units of NVIDIA and AMD GPUs.
In the examples I've experimented with, it appears that nvcc's --use_fast_math
and hipcc's -ffast-math
intermittently yield inaccurate or incorrect results. I'm still investigating this, but according to the documentation, the --use_fast_math of nvcc includes the following options: --ftz=true --prec-div=false --prec-sqrt=false --fmad=true
. It also interprets some single-precision mathematical functions as intrinsic functions. Conversely, hipcc's -ffast-math
consists of -fno-honor-infinities -fno-honor-nans -fapprox-func -fno-math-errno -ffinite-math-only -fassociative-math -freciprocal-math -fno-signed-zeros -fno-trapping-math -fno-rounding-math -ffp-contract=fast
.
Considering these options, it appears that nvcc's optimizations take advantage of NVIDIA's unique hardware units—MUFU—to expedite mathematical computations. In contrast, the optimizations of hipcc with an AMD backend seem to lean more towards software-level adaptations.
I realize this is a complex topic, and you may not have all the answers. Nonetheless, based on your expertise, would you say my interpretation of the differences aligns with your understanding? I truly appreciate any insights you might have!
@b-sumner Do you know whether it is possible to let sin(float) to emit ISA like __ocml_native_sin_f32 under certain options? Or there are always some additional computations even with the most relaxed options. Thanks.
@yxsamliu it is possible, but of course it would not match the cuda results on nvidia. Will that not be a problem?
@b-sumner I saw the ISA for sin(float) and __sinf
are different with -ffast-math -O3 https://godbolt.org/z/4vYdc6f86. What could cause this? Thanks.
@yxsamliu __sinf is a Cuda/HIP function that is implemented with a call to the native sin function, while sin is implemented with a call to the regular OCML sin function. That could presumably be changed using the __FAST_MATH__
macro.
@b-sumner You mean modify HIP headers to implement sin(float) as __sinf
when __FAST_MATH__
is defined since clang predefines macro __FAST_MATH__
for -ffast-math? That sounds like a viable solution. I saw Cuda-clang does a similar thing.
@yxsamliu exactly. However, I'd like to note that implementing this could break existing applications that are somehow dependent on the higher accuracy.
@b-sumner I am thinking probably we can introduce a new option -f[no-]hip-use-native to control whether to use native math functions for math functions. We can make it default off and let -ffast-math implies it to be on, but can be overridden by -fno-hip-use-native. If regressions happen, users can add -fno-hip-use-native to get back previous behaviour.
Sounds reasonable to me. We'll need to be sure to document this change.
Hi @yxsamliu and @b-sumner,
I sincerely appreciate your prior discussion and response. IMHO, implementing the changes discussed could greatly enhance the experience of NVIDIA GPU users. This is because aligning with the CUDA documentation would remove the need to individually tweak each math function for achieving similar results on NVIDIA GPUs using HIP.
In the case of AMD users, it might be useful for them to understand the potential impact of this optimization. As of now, I'm conducting tests to evaluate the performance and numerical consistency for both GPUs. My current findings indicate that enabling --use_fast_math on CUDA generally results in a performance boost, with correct outcomes in approximately 80% of cases.
The testing for AMD backend is ongoing. But significant performance improvements may not be as likely, assuming there's no specific hardware support on AMD GPUs.
Thank you for considering these thoughts.
https://reviews.llvm.org/D154790 https://reviews.llvm.org/D154797
we enabled native math functions with -ffast-math. It can be disabled with -fno-gpu-approx-transcendentals
Hi @yxsamliu , Thank you so much! To use this function, shall I install the latest ROCm?
It will take some time to get to a future release of ROCm.
Sure! I'd like to know when it is avaiable! Thanks!
@xinyi-li7 Can you please test with latest ROCm 6.1.0 (HIP 6.1)? If resolved, please close ticket. Thanks!
@xinyi-li7 -ffast-math is now available in the latest ROCm 6.1.1 Thanks!