HIP icon indicating copy to clipboard operation
HIP copied to clipboard

Query Regarding Fast-Math Optimization Support on AMD GPUs

Open xinyi-li7 opened this issue 1 year ago • 28 comments

Hello,

In the CUDA environment, compiling a program with nvcc using the -use_fast_math flag leverages the multi-function units (MFUs) of NVIDIA GPUs. This approach often results in a slightly less accurate, but faster executing, function.

I'm interested in whether there is a similar feature when programming with HIP on AMD GPUs. I noticed the HIP documentation mentions Floating-Point Intrinsics as being fast approximate functions. However, it wasn't explicitly stated whether these Intrinsics are compatible with AMD GPUs.

In my own tests, I observed a difference in results between a CUDA program run on an NVIDIA GPU with and without Intrinsics functions. This suggests that the Intrinsics are functioning as expected. However, when I ported the program to HIP and ran it on an AMD GPU, the use (or non-use) of Intrinsics functions did not affect the output, which was consistently precise.

Based on this, I'm led to assume that

fast-math optimizations might not be supported when programming with HIP on AMD GPUs.

Could you please confirm if my assumption is correct? If it is, are there any planned updates or workarounds to enable similar optimizations on AMD GPUs?

Thank you for your time, and I'm looking forward to your response.

xinyi-li7 avatar Jun 28 '23 23:06 xinyi-li7

You can use -ffast-math for HIP https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior

yxsamliu avatar Jun 29 '23 01:06 yxsamliu

Hi @yxsamliu , thanks for your replying.

From the doc, I guess this optimization flag can only affect the CPU/host but not GPU/device codes?

xinyi-li7 avatar Jun 29 '23 01:06 xinyi-li7

The option is applied to both CPU and GPU code.

yxsamliu avatar Jun 29 '23 01:06 yxsamliu

Thanks! I checked the listed options, it seems no optimization for math function like expf which is also optimized in NVIDIA devices. (Info here https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#intrinsic-functions). Do you have any idea on if AMD GPUs have such support? Thank you so much?

xinyi-li7 avatar Jun 29 '23 02:06 xinyi-li7

clang will link some specific device libraries when -ffast-math is specified, which defines some control variables affecting how these functions are implemented, e.g. https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/ocml/src/expF_base.h#L45

yxsamliu avatar Jun 29 '23 02:06 yxsamliu

Ah, I see!

But I applied the fast math option when compiling with hipcc on the abovementioned program. It is supposed to produce a different answer when using fast math but no changes now.

I was using many math functions inside. I can show your the code tomorrow if you want since I don't have it in my hand now. Thanks!

xinyi-li7 avatar Jun 29 '23 02:06 xinyi-li7

Hi @yxsamliu , Here is my code:

#include "hip/hip_runtime.h"

/* This is a automatically generated test. Do not modify */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

__global__
void compute(float comp, int var_1) {
for (int i=0; i < var_1; ++i) {
  if (comp <= sinhf(powf(atan2f(atan2f(ceilf(-0.0f), powf(-0.0f, sinhf(ldexpf(asinf(sqrtf(sinf(log10f(atan2f(expf(ldexpf(coshf(sinhf(tanhf(powf(fmodf(ldexpf(sinhf(-1.0233E21f), 2), acosf(+1.4791E-37f)), log10f(log10f(+1.9760E-29f)))))), 2)), fabsf(sinf(logf(log10f(logf(acosf(ldexpf(+1.2971E8f, 2)))))))))))), 2)))), asinf(-1.7706E35f)), +0.0f))) {
    comp += acosf(asinf(ceilf(asinf(-1.6714E-37f))));
}
}
   uint32_t hex_val;
   memcpy(&hex_val,&comp,sizeof(float));
//   printf("0x%x\n",hex_val);
   printf("%.17g\n",comp);

}

float* initPointer(float v) {
  float *ret = (float*) malloc(sizeof(float)*10);
  for(int i=0; i < 10; ++i)
    ret[i] = v;
  return ret;
}

int main(int argc, char** argv) {
/* Program variables */

  float tmp_1 = atof(argv[1]);
  int tmp_2 = atoi(argv[2]);

  compute<<<1,1>>>(tmp_1,tmp_2);
  hipDeviceSynchronize();
   hipError_t err = hipDeviceSynchronize();
    if (err != hipSuccess) {
        fprintf(stderr, "Error: %s\n", hipGetErrorString(err));
        return 1;
    }
  return 0;
}

The corresponding CUDA program will get 1.5707963705062866 without -use_fast_math and -1.504299955062057e-37 with -use_fast_math. However, w/wo using -ffast_math we will get 1.5707963705062866 on AMD GPUs. So I doubt if -ffast_math took the effect.

It would be good to dump the assembly code to check if an optimization is taken. However, when I tried roc-obj-ls ./test.hip.exe | roc-obj-extrac as here indicates, it gives me error:

Error: '1       host-x86_64-unknown-linux                                           file' is not recognized as a supported code object URI.
Error: '1       hipv4-amdgcn-amd-amdhsa--gfx908                                     file' is not recognized as a supported code object URI.

Any help is appreciated! Thanks!

xinyi-li7 avatar Jun 30 '23 01:06 xinyi-li7

you can use -save-temps to compile your program. The assembly can be found in the *-gfx908.s file.

yxsamliu avatar Jun 30 '23 15:06 yxsamliu

Hi @yxsamliu , Thanks for your replying. I just checked the intermediate code and found there's only one line difference w/wo using -ffast-math:

< 	v_cmp_nge_f32_e32 vcc, s4, v36
---
> 	v_cmp_lt_f32_e32 vcc, s4, v36

Does that mean -ffast-math didn't do any optimizations on the math functions? Thanks!

xinyi-li7 avatar Jul 01 '23 00:07 xinyi-li7

Your floating-point computation is all computing floating-point constants. clang will optimize them and get results at compile time, which is always accurate. That's why you cannot see the difference.

Try replacing those fp constants as kernel arguments to avoid clang optimizing them at compile time.

yxsamliu avatar Jul 01 '23 02:07 yxsamliu

Hi @yxsamliu, I've changed my computation code as

  if (comp <= sinhf(powf(atan2f(atan2f(ceilf(comp), powf(var_1, sinhf(ldexpf(asinf(sqrtf(sinf(log10f(atan2f(expf(ldexpf(coshf(sinhf(tanhf(powf(fmodf(ldexpf(sinhf(-1.0233E21f), 2), acosf(+1.4791E-37f)), log10f(log10f(comp)))))), 2)), fabsf(sinf(logf(log10f(logf(acosf(ldexpf(+1.2971E8f, 2)))))))))))), 2)))), asinf(-1.7706E35f)), +0.0f))) {

Still I didn't observe any differences in the .s files.

I was running the programs on MI100. Will it affect the usage of specific math API? I mean, does that mean MI100 doesn't provide the mutli-function mathematical units as NVIDIA GPUs? Thanks!

xinyi-li7 avatar Jul 04 '23 01:07 xinyi-li7

The expression you mentioned can be simplified to powrf(x, 0), which the compiler can optimize to 1, which does not depend on -ffast-math. https://godbolt.org/z/sWEYdbKcn

try change that +0.0f to a kernel arg

yxsamliu avatar Jul 04 '23 13:07 yxsamliu

also, your final result mostly depends on acosf(asinf(ceilf(asinf(-1.6714E-37f)))), the -1.6714E-37f should be changed to a kernel argument too. Otherwise you may not see difference due to -ffast-math

yxsamliu avatar Jul 04 '23 13:07 yxsamliu

Hi @yxsamliu, Thank you for your patient and considerate response. I've modified the codes based on your suggestions and now I'm noticing the differences!

Presently, I'm in the process of evaluating the numerical computation units of NVIDIA and AMD GPUs.

In the examples I've experimented with, it appears that nvcc's --use_fast_math and hipcc's -ffast-math intermittently yield inaccurate or incorrect results. I'm still investigating this, but according to the documentation, the --use_fast_math of nvcc includes the following options: --ftz=true --prec-div=false --prec-sqrt=false --fmad=true. It also interprets some single-precision mathematical functions as intrinsic functions. Conversely, hipcc's -ffast-math consists of -fno-honor-infinities -fno-honor-nans -fapprox-func -fno-math-errno -ffinite-math-only -fassociative-math -freciprocal-math -fno-signed-zeros -fno-trapping-math -fno-rounding-math -ffp-contract=fast.

Considering these options, it appears that nvcc's optimizations take advantage of NVIDIA's unique hardware units—MUFU—to expedite mathematical computations. In contrast, the optimizations of hipcc with an AMD backend seem to lean more towards software-level adaptations.

I realize this is a complex topic, and you may not have all the answers. Nonetheless, based on your expertise, would you say my interpretation of the differences aligns with your understanding? I truly appreciate any insights you might have!

xinyi-li7 avatar Jul 05 '23 21:07 xinyi-li7

@b-sumner Do you know whether it is possible to let sin(float) to emit ISA like __ocml_native_sin_f32 under certain options? Or there are always some additional computations even with the most relaxed options. Thanks.

yxsamliu avatar Jul 06 '23 16:07 yxsamliu

@yxsamliu it is possible, but of course it would not match the cuda results on nvidia. Will that not be a problem?

b-sumner avatar Jul 06 '23 16:07 b-sumner

@b-sumner I saw the ISA for sin(float) and __sinf are different with -ffast-math -O3 https://godbolt.org/z/4vYdc6f86. What could cause this? Thanks.

yxsamliu avatar Jul 06 '23 16:07 yxsamliu

@yxsamliu __sinf is a Cuda/HIP function that is implemented with a call to the native sin function, while sin is implemented with a call to the regular OCML sin function. That could presumably be changed using the __FAST_MATH__ macro.

b-sumner avatar Jul 06 '23 16:07 b-sumner

@b-sumner You mean modify HIP headers to implement sin(float) as __sinf when __FAST_MATH__ is defined since clang predefines macro __FAST_MATH__ for -ffast-math? That sounds like a viable solution. I saw Cuda-clang does a similar thing.

yxsamliu avatar Jul 06 '23 18:07 yxsamliu

@yxsamliu exactly. However, I'd like to note that implementing this could break existing applications that are somehow dependent on the higher accuracy.

b-sumner avatar Jul 06 '23 19:07 b-sumner

@b-sumner I am thinking probably we can introduce a new option -f[no-]hip-use-native to control whether to use native math functions for math functions. We can make it default off and let -ffast-math implies it to be on, but can be overridden by -fno-hip-use-native. If regressions happen, users can add -fno-hip-use-native to get back previous behaviour.

yxsamliu avatar Jul 06 '23 20:07 yxsamliu

Sounds reasonable to me. We'll need to be sure to document this change.

b-sumner avatar Jul 06 '23 20:07 b-sumner

Hi @yxsamliu and @b-sumner,

I sincerely appreciate your prior discussion and response. IMHO, implementing the changes discussed could greatly enhance the experience of NVIDIA GPU users. This is because aligning with the CUDA documentation would remove the need to individually tweak each math function for achieving similar results on NVIDIA GPUs using HIP.

In the case of AMD users, it might be useful for them to understand the potential impact of this optimization. As of now, I'm conducting tests to evaluate the performance and numerical consistency for both GPUs. My current findings indicate that enabling --use_fast_math on CUDA generally results in a performance boost, with correct outcomes in approximately 80% of cases.

The testing for AMD backend is ongoing. But significant performance improvements may not be as likely, assuming there's no specific hardware support on AMD GPUs.

Thank you for considering these thoughts.

xinyi-li7 avatar Jul 06 '23 21:07 xinyi-li7

https://reviews.llvm.org/D154790 https://reviews.llvm.org/D154797

we enabled native math functions with -ffast-math. It can be disabled with -fno-gpu-approx-transcendentals

yxsamliu avatar Jul 25 '23 17:07 yxsamliu

Hi @yxsamliu , Thank you so much! To use this function, shall I install the latest ROCm?

xinyi-li7 avatar Jul 25 '23 21:07 xinyi-li7

It will take some time to get to a future release of ROCm.

yxsamliu avatar Jul 25 '23 22:07 yxsamliu

Sure! I'd like to know when it is avaiable! Thanks!

xinyi-li7 avatar Jul 25 '23 23:07 xinyi-li7

@xinyi-li7 Can you please test with latest ROCm 6.1.0 (HIP 6.1)? If resolved, please close ticket. Thanks!

ppanchad-amd avatar Apr 25 '24 18:04 ppanchad-amd

@xinyi-li7 -ffast-math is now available in the latest ROCm 6.1.1 Thanks!

ppanchad-amd avatar May 31 '24 14:05 ppanchad-amd