HIP hipcc optimization issue on MI100, gfx908

hipcc optimization issue on MI100, gfx908

Open ramin-raeisi opened this issue 2 years ago • 36 comments

Hello *,

hipcc compiler seems to be messing up the execution results of a specific add kernel on MI100, gfx908 card.

The results are the same on my RX Vega 56, using -O0 and -O3, and on MI100, using -O0, nevertheless, results are incorrect in the case of using -O3 on MI100.

Please refer to https://github.com/ramin-raeisi/HIP-Optimization-Bug for a ready-to-verify piece of code.

I do appreciate your help and support, I am afraid this should be a bug of the compiler.

Ramin

Apr 16 '22 14:04 ramin-raeisi

Hello @ramin-raeisi ,

Thank you for brining this in to notice. Usually the ideal place to post problems is this repo : https://github.com/RadeonOpenCompute/ROCm/issues

Let me have a deeper look in to the problem you are facing and get back to you.

Apr 22 '22 04:04 ROCmSupport

@ROCmSupport ,

I look forward to having your assessment result.

Apr 25 '22 05:04 ramin-raeisi

@ramin-raeisi What are the expected results ? The outputs from gfx900 using O0 and O3 also seem different. Thanks.

Apr 26 '22 21:04 zjin-lcf

@ROCmSupport, @zjin-lcf

I noticed a small issue in https://github.com/ramin-raeisi/HIP-Optimization-Bug and fixed it, however, it had nothing to do with the optimization problem.

I had also forgotten to mention that I had a similar issue with gfx900 as well and I had to temporarily disable the optimization for the G1_add function using [[clang::optnone]]. Having the optnone for the G1_add function, all the results would be the same for both gfx900 and 908 and equal to the expected values coming from the CUDA version.

Please observe the following branches:

https://github.com/ramin-raeisi/HIP-Optimization-Bug/tree/verified-cuda-hip
https://github.com/ramin-raeisi/HIP-Optimization-Bug/tree/optnone

Apr 27 '22 10:04 ramin-raeisi

There are in-line assembly codes in the example for efficiency. I wonder if they may be converted to higher level HIP codes for portability.

Apr 29 '22 12:04 zjin-lcf

Would you please try the following patch to the g1_add.cu? diff --git a/g1_add.cu b/g1_add.cu index 7922282..f740027 100644 --- a/g1_add.cu +++ b/g1_add.cu @@ -675,7 +675,7 @@ DEVICE Fq Fq_sub_nvidia(Fq a, Fq b) { asm volatile("v_subb_co_u32 %0, vcc, %0, %1, vcc" : "+v"(a.val[9]): "v"(b.val[9]): "vcc"); asm volatile("v_subb_co_u32 %0, vcc, %0, %1, vcc" : "+v"(a.val[10]): "v"(b.val[10]): "vcc"); asm volatile("v_subb_co_u32 %0, vcc, %0, %1, vcc" : "+v"(a.val[11]): "v"(b.val[11]): "vcc");

asm volatile("v_sub_co_u32 %0, %0, %1" : "+v"(a.val[11]): "v"(0));

asm volatile("v_sub_co_u32 %0, %0, %1" : "+v"(a.val[11]): "v"(0): "vcc"); return a; }

May 03 '22 18:05 changpeng

@changpeng , Would you please provide the patch file, the text here has some issues I think?

May 03 '22 19:05 ramin-raeisi

Note sure how to attach a file here. But for "DEVICE Fq Fq_sub_nvidia(Fq a, Fq b)", you just need to change the last line:

minus: asm volatile("v_sub_co_u32 %0, %0, %1" : "+v"(a.val[11]): "v"(0)); plus: asm volatile("v_sub_co_u32 %0, %0, %1" : "+v"(a.val[11]): "v"(0): "vcc");

May 03 '22 19:05 changpeng

@changpeng , I did the same and verified against -O3, assertion, on which branch did you test?

May 03 '22 19:05 ramin-raeisi

https://github.com/ramin-raeisi/HIP-Optimization-Bug

May 03 '22 19:05 changpeng

You can also try [[clang::optnone]] to Fq_sub_nvidia

May 03 '22 19:05 changpeng

I checked both, got assertion for both, what hipcc version have you checked with and what did you get? No assertion? And with -O3?

May 03 '22 19:05 ramin-raeisi

With -O3. I tried with rocm 5.1 and later I think. What assert did you get (surprise that disable opt for a function will lead to an assert)

May 03 '22 19:05 changpeng

HIP version: 5.0.13601-ded05588
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.0.2/llvm/bin

x: 3219994337, 3268783046, 4031727028, 2691389613, 2832830248, 685174683, 2324737265, 2966079159, 2749396165, 3350077799, 2693838205, 102950736, 
y: 1921116284, 4154027525, 4131808175, 3732338784, 2112707696, 878390733, 3194244456, 3030363306, 3771752044, 2941988557, 962708640, 231501946, 
z: 1125492406, 112054852, 4266465428, 1421064550, 1600350728, 64230541, 1680272999, 3376539102, 1417617147, 2686512101, 1145475024, 431544102, 

g1_add: g1_add.cu:1229: int main(): Assertion `flattenResult[i]==cudaExpected[i]' failed.
Aborted (core dumped)

I meant incorrect expected value

May 03 '22 19:05 ramin-raeisi

I am trying to get a rocm 5.0 to re-verify. Will get back to you soon.

May 03 '22 19:05 changpeng

I am using the same compiler but can pass with -O3,

AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /home/chfang/tmp/opt/rocm-5.0.2/llvm/bin

x: 695163763, 838428337, 867136025, 3916970060, 1083605276, 2882035772, 3603006931, 2269309842, 422274527, 1169772790, 1990394245, 416975321, y: 1229022948, 3366429108, 670218974, 1658335027, 392632874, 1379067484, 798160530, 3656524164, 3793686573, 2144155088, 2721370348, 298035558, z: 413203031, 3318893592, 1282426328, 1145762026, 1542369093, 485346739, 1679000480, 4026228341, 2371190916, 3558967189, 3094593878, 414846589,

No error

May 03 '22 19:05 changpeng

Just by changing that single line or having [[clang::optnone]] for Fq_sub_nvidia? on the main branch of https://github.com/ramin-raeisi/HIP-Optimization-Bug? Weird, Let me try on my another machine as well

May 03 '22 20:05 ramin-raeisi

Please verify if your latest commit is d29000b84410a19448b709ccd18c3b09de1706a0

May 03 '22 20:05 ramin-raeisi

commit d29000b84410a19448b709ccd18c3b09de1706a0 (HEAD -> main, origin/verified-cuda-hip, origin/main, origin/HEAD) Author: Ramin Raeise [email protected] Date: Wed Apr 27 14:06:55 2022 +0430

May 03 '22 20:05 changpeng

Having your patch done on my Radeon RX Vega 56 machine, it passes! But not on MI100 machine, however I used to have multiple hip versions there, I think I need to go for a clean test and see. I assume you have tested on a MI100 machine, right?

May 03 '22 20:05 ramin-raeisi

Right. I tested on MI100 and also a vega20 (gfx906) system

May 03 '22 20:05 changpeng

I will do a clean test and inform, that was a great catch @changpeng , appreciate that.

May 03 '22 20:05 ramin-raeisi

@changpeng It is failing on my MI100 machine

May 04 '22 18:05 ramin-raeisi

@changpeng Another witnessed behavior, hipcc would hang if I add [[clang::optnone]] for all of the assembly functions, this was another weird behavior, could you please verify this on your side?

May 04 '22 18:05 ramin-raeisi

@ramin-raeisi : it also hangs on my side. Could not understand it still fails on you MI100 system. is it possible to update your rocm to 5.1 and retry?

May 04 '22 19:05 changpeng

Are you sure you are not doing the change on Fr_sub_nvidia(Fr a, Fr b), @ramin-raeisi ?

May 04 '22 19:05 changpeng

Are you sure you are not doing the change on Fr_sub_nvidia(Fr a, Fr b), @ramin-raeisi ?

Yes, both, adding"vcc", and [[clang::optnone]] tested. It was ok on my Radeon RX Vega 56, but not on my MI100.

May 04 '22 20:05 ramin-raeisi

Fr_sub_nvidia or Fq_sub_nvidia? We need to do for the later Fq_sub_nvidia

May 04 '22 20:05 changpeng

Yes, for Fq_sub_nvidia

May 04 '22 20:05 ramin-raeisi

@changpeng Do you want to debug and troubleshoot this on my MI100 machine?

May 04 '22 20:05 ramin-raeisi

HIP HIP copied to clipboard

hipcc optimization issue on MI100, gfx908

HIP
HIP copied to clipboard