HIP
HIP copied to clipboard
hipcc optimization issue on MI100, gfx908
Hello *,
hipcc compiler seems to be messing up the execution results of a specific add kernel on MI100, gfx908 card.
The results are the same on my RX Vega 56, using -O0 and -O3, and on MI100, using -O0, nevertheless, results are incorrect in the case of using -O3 on MI100.
Please refer to https://github.com/ramin-raeisi/HIP-Optimization-Bug for a ready-to-verify piece of code.
I do appreciate your help and support, I am afraid this should be a bug of the compiler.
Ramin
Hello @ramin-raeisi ,
Thank you for brining this in to notice. Usually the ideal place to post problems is this repo : https://github.com/RadeonOpenCompute/ROCm/issues
Let me have a deeper look in to the problem you are facing and get back to you.
@ROCmSupport ,
I look forward to having your assessment result.
@ramin-raeisi What are the expected results ? The outputs from gfx900 using O0 and O3 also seem different. Thanks.
@ROCmSupport, @zjin-lcf
I noticed a small issue in https://github.com/ramin-raeisi/HIP-Optimization-Bug and fixed it, however, it had nothing to do with the optimization problem.
I had also forgotten to mention that I had a similar issue with gfx900 as well and I had to temporarily disable the optimization for the G1_add function using [[clang::optnone]]. Having the optnone for the G1_add function, all the results would be the same for both gfx900 and 908 and equal to the expected values coming from the CUDA version.
Please observe the following branches:
- https://github.com/ramin-raeisi/HIP-Optimization-Bug/tree/verified-cuda-hip
- https://github.com/ramin-raeisi/HIP-Optimization-Bug/tree/optnone
There are in-line assembly codes in the example for efficiency. I wonder if they may be converted to higher level HIP codes for portability.
Would you please try the following patch to the g1_add.cu? diff --git a/g1_add.cu b/g1_add.cu index 7922282..f740027 100644 --- a/g1_add.cu +++ b/g1_add.cu @@ -675,7 +675,7 @@ DEVICE Fq Fq_sub_nvidia(Fq a, Fq b) { asm volatile("v_subb_co_u32 %0, vcc, %0, %1, vcc" : "+v"(a.val[9]): "v"(b.val[9]): "vcc"); asm volatile("v_subb_co_u32 %0, vcc, %0, %1, vcc" : "+v"(a.val[10]): "v"(b.val[10]): "vcc"); asm volatile("v_subb_co_u32 %0, vcc, %0, %1, vcc" : "+v"(a.val[11]): "v"(b.val[11]): "vcc");
- asm volatile("v_sub_co_u32 %0, %0, %1" : "+v"(a.val[11]): "v"(0));
- asm volatile("v_sub_co_u32 %0, %0, %1" : "+v"(a.val[11]): "v"(0): "vcc"); return a; }
@changpeng , Would you please provide the patch file, the text here has some issues I think?
Note sure how to attach a file here. But for "DEVICE Fq Fq_sub_nvidia(Fq a, Fq b)", you just need to change the last line:
minus: asm volatile("v_sub_co_u32 %0, %0, %1" : "+v"(a.val[11]): "v"(0)); plus: asm volatile("v_sub_co_u32 %0, %0, %1" : "+v"(a.val[11]): "v"(0): "vcc");
@changpeng , I did the same and verified against -O3, assertion, on which branch did you test?
https://github.com/ramin-raeisi/HIP-Optimization-Bug
You can also try [[clang::optnone]] to Fq_sub_nvidia
I checked both, got assertion for both, what hipcc version have you checked with and what did you get? No assertion? And with -O3?
With -O3. I tried with rocm 5.1 and later I think. What assert did you get (surprise that disable opt for a function will lead to an assert)
HIP version: 5.0.13601-ded05588
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.0.2/llvm/bin
x: 3219994337, 3268783046, 4031727028, 2691389613, 2832830248, 685174683, 2324737265, 2966079159, 2749396165, 3350077799, 2693838205, 102950736,
y: 1921116284, 4154027525, 4131808175, 3732338784, 2112707696, 878390733, 3194244456, 3030363306, 3771752044, 2941988557, 962708640, 231501946,
z: 1125492406, 112054852, 4266465428, 1421064550, 1600350728, 64230541, 1680272999, 3376539102, 1417617147, 2686512101, 1145475024, 431544102,
g1_add: g1_add.cu:1229: int main(): Assertion `flattenResult[i]==cudaExpected[i]' failed.
Aborted (core dumped)
I meant incorrect expected value
I am trying to get a rocm 5.0 to re-verify. Will get back to you soon.
I am using the same compiler but can pass with -O3,
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.0.2 22065 030a405a181176f1a7749819092f4ef8ea5f0758) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /home/chfang/tmp/opt/rocm-5.0.2/llvm/bin
x: 695163763, 838428337, 867136025, 3916970060, 1083605276, 2882035772, 3603006931, 2269309842, 422274527, 1169772790, 1990394245, 416975321, y: 1229022948, 3366429108, 670218974, 1658335027, 392632874, 1379067484, 798160530, 3656524164, 3793686573, 2144155088, 2721370348, 298035558, z: 413203031, 3318893592, 1282426328, 1145762026, 1542369093, 485346739, 1679000480, 4026228341, 2371190916, 3558967189, 3094593878, 414846589,
No error
Just by changing that single line or having [[clang::optnone]] for Fq_sub_nvidia? on the main branch of https://github.com/ramin-raeisi/HIP-Optimization-Bug? Weird, Let me try on my another machine as well
Please verify if your latest commit is d29000b84410a19448b709ccd18c3b09de1706a0
commit d29000b84410a19448b709ccd18c3b09de1706a0 (HEAD -> main, origin/verified-cuda-hip, origin/main, origin/HEAD) Author: Ramin Raeise [email protected] Date: Wed Apr 27 14:06:55 2022 +0430
Having your patch done on my Radeon RX Vega 56 machine, it passes! But not on MI100 machine, however I used to have multiple hip versions there, I think I need to go for a clean test and see. I assume you have tested on a MI100 machine, right?
Right. I tested on MI100 and also a vega20 (gfx906) system
I will do a clean test and inform, that was a great catch @changpeng , appreciate that.
@changpeng It is failing on my MI100 machine
@changpeng Another witnessed behavior, hipcc would hang if I add [[clang::optnone]] for all of the assembly functions, this was another weird behavior, could you please verify this on your side?
@ramin-raeisi : it also hangs on my side. Could not understand it still fails on you MI100 system. is it possible to update your rocm to 5.1 and retry?
Are you sure you are not doing the change on Fr_sub_nvidia(Fr a, Fr b), @ramin-raeisi ?
Are you sure you are not doing the change on Fr_sub_nvidia(Fr a, Fr b), @ramin-raeisi ?
Yes, both, adding"vcc", and [[clang::optnone]] tested. It was ok on my Radeon RX Vega 56, but not on my MI100.
Fr_sub_nvidia or Fq_sub_nvidia? We need to do for the later Fq_sub_nvidia
Yes, for Fq_sub_nvidia
@changpeng Do you want to debug and troubleshoot this on my MI100 machine?