HIP icon indicating copy to clipboard operation
HIP copied to clipboard

hipcc optimization issue on MI100, gfx908

Open ramin-raeisi opened this issue 2 years ago • 36 comments

Hello *,

hipcc compiler seems to be messing up the execution results of a specific add kernel on MI100, gfx908 card.

The results are the same on my RX Vega 56, using -O0 and -O3, and on MI100, using -O0, nevertheless, results are incorrect in the case of using -O3 on MI100.

Please refer to https://github.com/ramin-raeisi/HIP-Optimization-Bug for a ready-to-verify piece of code.

I do appreciate your help and support, I am afraid this should be a bug of the compiler.

Ramin

ramin-raeisi avatar Apr 16 '22 14:04 ramin-raeisi

Hello @ramin-raeisi ,

Thank you for brining this in to notice. Usually the ideal place to post problems is this repo : https://github.com/RadeonOpenCompute/ROCm/issues

Let me have a deeper look in to the problem you are facing and get back to you.

ROCmSupport avatar Apr 22 '22 04:04 ROCmSupport

@ROCmSupport ,

I look forward to having your assessment result.

ramin-raeisi avatar Apr 25 '22 05:04 ramin-raeisi

@ramin-raeisi What are the expected results ? The outputs from gfx900 using O0 and O3 also seem different. Thanks.

zjin-lcf avatar Apr 26 '22 21:04 zjin-lcf

@ROCmSupport, @zjin-lcf

I noticed a small issue in https://github.com/ramin-raeisi/HIP-Optimization-Bug and fixed it, however, it had nothing to do with the optimization problem.

I had also forgotten to mention that I had a similar issue with gfx900 as well and I had to temporarily disable the optimization for the G1_add function using [[clang::optnone]]. Having the optnone for the G1_add function, all the results would be the same for both gfx900 and 908 and equal to the expected values coming from the CUDA version.

Please observe the following branches:

  • https://github.com/ramin-raeisi/HIP-Optimization-Bug/tree/verified-cuda-hip
  • https://github.com/ramin-raeisi/HIP-Optimization-Bug/tree/optnone

ramin-raeisi avatar Apr 27 '22 10:04 ramin-raeisi

There are in-line assembly codes in the example for efficiency. I wonder if they may be converted to higher level HIP codes for portability.

zjin-lcf avatar Apr 29 '22 12:04 zjin-lcf

Would you please try the following patch to the g1_add.cu? diff --git a/g1_add.cu b/g1_add.cu index 7922282..f740027 100644 --- a/g1_add.cu +++ b/g1_add.cu @@ -675,7 +675,7 @@ DEVICE Fq Fq_sub_nvidia(Fq a, Fq b) { asm volatile("v_subb_co_u32 %0, vcc, %0, %1, vcc" : "+v"(a.val[9]): "v"(b.val[9]): "vcc"); asm volatile("v_subb_co_u32 %0, vcc, %0, %1, vcc" : "+v"(a.val[10]): "v"(b.val[10]): "vcc"); asm volatile("v_subb_co_u32 %0, vcc, %0, %1, vcc" : "+v"(a.val[11]): "v"(b.val[11]): "vcc");

  • asm volatile("v_sub_co_u32 %0, %0, %1" : "+v"(a.val[11]): "v"(0));
  • asm volatile("v_sub_co_u32 %0, %0, %1" : "+v"(a.val[11]): "v"(0): "vcc"); return a; }

changpeng avatar May 03 '22 18:05 changpeng

@changpeng , Would you please provide the patch file, the text here has some issues I think?

ramin-raeisi avatar May 03 '22 19:05 ramin-raeisi

Note sure how to attach a file here. But for "DEVICE Fq Fq_sub_nvidia(Fq a, Fq b)", you just need to change the last line:

minus: asm volatile("v_sub_co_u32 %0, %0, %1" : "+v"(a.val[11]): "v"(0)); plus: asm volatile("v_sub_co_u32 %0, %0, %1" : "+v"(a.val[11]): "v"(0): "vcc");

changpeng avatar May 03 '22 19:05 changpeng

@changpeng , I did the same and verified against -O3, assertion, on which branch did you test?

ramin-raeisi avatar May 03 '22 19:05 ramin-raeisi

https://github.com/ramin-raeisi/HIP-Optimization-Bug

changpeng avatar May 03 '22 19:05 changpeng

You can also try [[clang::optnone]] to Fq_sub_nvidia

changpeng avatar May 03 '22 19:05 changpeng

I checked both, got assertion for both, what hipcc version have you checked with and what did you get? No assertion? And with -O3?

ramin-raeisi avatar May 03 '22 19:05 ramin-raeisi

With -O3. I tried with rocm 5.1 and later I think. What assert did you get (surprise that disable opt for a function will lead to an assert)

changpeng avatar May 03 '22 19:05 changpeng

HIP version: 5.0.13601-ded05588
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.0.2/llvm/bin
x: 3219994337, 3268783046, 4031727028, 2691389613, 2832830248, 685174683, 2324737265, 2966079159, 2749396165, 3350077799, 2693838205, 102950736, 
y: 1921116284, 4154027525, 4131808175, 3732338784, 2112707696, 878390733, 3194244456, 3030363306, 3771752044, 2941988557, 962708640, 231501946, 
z: 1125492406, 112054852, 4266465428, 1421064550, 1600350728, 64230541, 1680272999, 3376539102, 1417617147, 2686512101, 1145475024, 431544102, 

g1_add: g1_add.cu:1229: int main(): Assertion `flattenResult[i]==cudaExpected[i]' failed.
Aborted (core dumped)

I meant incorrect expected value

ramin-raeisi avatar May 03 '22 19:05 ramin-raeisi

I am trying to get a rocm 5.0 to re-verify. Will get back to you soon.

changpeng avatar May 03 '22 19:05 changpeng

I am using the same compiler but can pass with -O3,

AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /home/chfang/tmp/opt/rocm-5.0.2/llvm/bin

x: 695163763, 838428337, 867136025, 3916970060, 1083605276, 2882035772, 3603006931, 2269309842, 422274527, 1169772790, 1990394245, 416975321, y: 1229022948, 3366429108, 670218974, 1658335027, 392632874, 1379067484, 798160530, 3656524164, 3793686573, 2144155088, 2721370348, 298035558, z: 413203031, 3318893592, 1282426328, 1145762026, 1542369093, 485346739, 1679000480, 4026228341, 2371190916, 3558967189, 3094593878, 414846589,

No error

changpeng avatar May 03 '22 19:05 changpeng

Just by changing that single line or having [[clang::optnone]] for Fq_sub_nvidia? on the main branch of https://github.com/ramin-raeisi/HIP-Optimization-Bug? Weird, Let me try on my another machine as well

ramin-raeisi avatar May 03 '22 20:05 ramin-raeisi

Please verify if your latest commit is d29000b84410a19448b709ccd18c3b09de1706a0

ramin-raeisi avatar May 03 '22 20:05 ramin-raeisi

commit d29000b84410a19448b709ccd18c3b09de1706a0 (HEAD -> main, origin/verified-cuda-hip, origin/main, origin/HEAD) Author: Ramin Raeise [email protected] Date: Wed Apr 27 14:06:55 2022 +0430

changpeng avatar May 03 '22 20:05 changpeng

Having your patch done on my Radeon RX Vega 56 machine, it passes! But not on MI100 machine, however I used to have multiple hip versions there, I think I need to go for a clean test and see. I assume you have tested on a MI100 machine, right?

ramin-raeisi avatar May 03 '22 20:05 ramin-raeisi

Right. I tested on MI100 and also a vega20 (gfx906) system

changpeng avatar May 03 '22 20:05 changpeng

I will do a clean test and inform, that was a great catch @changpeng , appreciate that.

ramin-raeisi avatar May 03 '22 20:05 ramin-raeisi

@changpeng It is failing on my MI100 machine

ramin-raeisi avatar May 04 '22 18:05 ramin-raeisi

@changpeng Another witnessed behavior, hipcc would hang if I add [[clang::optnone]] for all of the assembly functions, this was another weird behavior, could you please verify this on your side?

ramin-raeisi avatar May 04 '22 18:05 ramin-raeisi

@ramin-raeisi : it also hangs on my side. Could not understand it still fails on you MI100 system. is it possible to update your rocm to 5.1 and retry?

changpeng avatar May 04 '22 19:05 changpeng

Are you sure you are not doing the change on Fr_sub_nvidia(Fr a, Fr b), @ramin-raeisi ?

changpeng avatar May 04 '22 19:05 changpeng

Are you sure you are not doing the change on Fr_sub_nvidia(Fr a, Fr b), @ramin-raeisi ?

Yes, both, adding"vcc", and [[clang::optnone]] tested. It was ok on my Radeon RX Vega 56, but not on my MI100.

ramin-raeisi avatar May 04 '22 20:05 ramin-raeisi

Fr_sub_nvidia or Fq_sub_nvidia? We need to do for the later Fq_sub_nvidia

changpeng avatar May 04 '22 20:05 changpeng

Yes, for Fq_sub_nvidia

ramin-raeisi avatar May 04 '22 20:05 ramin-raeisi

@changpeng Do you want to debug and troubleshoot this on my MI100 machine?

ramin-raeisi avatar May 04 '22 20:05 ramin-raeisi