hcc icon indicating copy to clipboard operation
hcc copied to clipboard

Memory access fault by GPU node-2 on Fiji gfx803 (with test cases and ISA)

Open ex-rzr opened this issue 6 years ago • 9 comments

This issue is related to https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/52 https://github.com/ROCmSoftwarePlatform/rocPRIM/issues/15

HCC 1.3.18482 (ROCm 2.0)

Memory access fault by GPU node-2 (Agent handle: 0x7f3e4aa1b090) on address 0x1e04405000. Reason: Page not present or supervisor privilege.

https://github.com/ROCmSoftwarePlatform/rocPRIM/tree/for_hcc_issue_1024

This branch has the workaround disabled and only one active test for simplicity (https://github.com/ROCmSoftwarePlatform/rocPRIM/commit/e9753c4ca92eb3d8b88a4bdb0e81ce763c86cea5).

git clone -b for_hcc_issue_1024 https://github.com/ROCmSoftwarePlatform/rocPRIM.git
cd rocPRIM
mkdir build && cd build
CXX=hcc cmake ..
make test_hip_device_scan && test/rocprim/test_hip_device_scan

https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/for_hcc_issue_1024/rocprim/include/rocprim/device/detail/device_scan_lookback.hpp#L279

Both branches of if statement (if(flat_block_id == 0)) can be executed for a thread block with flat_block_id == 0. I checked it by disabling some parts of the code and writing intermediate values to buffers, and indeed, both if-branch and else-branch are executed leading to incorrect memory access.

So this "workaround" separates two branches of the if statement. syncthreads and the second if are not required for the algorithm. This change makes the kernel work correctly, however, without any guarantee (I mean it's possible that for other types and scan op, for example, the workaround won't help).

ISA for this test: isa_memory_access_fault.zip

test_hip_device_scan-gfx803-workaround.isa and test_hip_device_scan-gfx900-workaround.isa, correct results:

	s_barrier                                                  // 0000000035A0: BF8A0000 
	v_cmp_ne_u32_e32 vcc, 0, v4                                // 0000000035A4: 7D9A0880 
	s_and_b64 vcc, exec, vcc                                   // 0000000035A8: 86EA6A7E 
	s_cbranch_vccz BB1_40                                      // 0000000035AC: BF860015 

test_hip_device_scan-gfx900.isa (without workaround), correct results:

	v_cmp_ne_u32_e32 vcc, 0, v5                                // 00000000361C: 7D9A0A80 
	s_and_b64 vcc, exec, vcc                                   // 000000003620: 86EA6A7E 

	... no vcc modification

	v_add_u32_e32 v28, -1, v28                                 // 000000003758: 683838C1 
	v_add_u32_e32 v26, v27, v26                                // 00000000375C: 6834351B 
	v_lshrrev_b32_e32 v27, 6, v0                               // 000000003760: 20360086 
	v_cmp_eq_u32_e64 s[6:7], v28, v0                           // 000000003764: D0CA0006 0002011C 
	s_cbranch_vccz BB1_64                                      // 00000000376C: BF860144 

test_hip_device_scan-gfx803.isa (without workaround), incorrect results:

	... no v_cmp_ne_u32_e32 vcc, 0, v5 equivalent

	v_add_u32_e32 v28, vcc, -1, v28                            // 00000000375C: 323838C1 
	s_and_b64 vcc, exec, vcc                                   // 000000003760: 86EA6A7E <-- totally unrelated value?
	v_cmp_eq_u32_e64 s[6:7], v28, v0                           // 000000003764: D0CA0006 0002011C 
	s_cbranch_vccz BB1_64                                      // 00000000376C: BF860149 

Here v4 and v5 are unsinged int flat_block_id, i.e. the variable in the if.

Another kernel with a similar problem:

tabulate-gfx900.isa, correct results:

	v_cmp_ne_u32_e32 vcc, s6, v3                               // 000000002358: 7D9A0606 
	s_addc_u32 s2, s13, s3                                     // 00000000235C: 8202030D 
	v_mov_b32_e32 v1, 0                                        // 000000002360: 7E020280 
	s_and_b64 vcc, exec, vcc                                   // 000000002364: 86EA6A7E 
	v_add_u32_e32 v3, s0, v4                                   // 000000002368: 68060800 
	s_cbranch_vccz BB1_2                                       // 00000000236C: BF86000A 

tabulate-gfx803.isa, incorrect results:

	v_add_u32_e32 v3, vcc, s0, v1                              // 000000002340: 32060200 
	s_and_b64 vcc, exec, vcc                                   // 000000002344: 86EA6A7E 
	v_mov_b32_e32 v1, 0                                        // 000000002348: 7E020280 
	s_cbranch_vccz BB1_2                                       // 00000000234C: BF86000A 

In both cases, for gfx803, the compiler overwrites vcc register in v_add, so v_cmp_* vcc is completely removed (I suppose it was there just like for gfx900), s_cbranch_vccz proceeds with the ruined value of vcc.

ex-rzr avatar Feb 05 '19 08:02 ex-rzr

@ex-rzr could you point back to the specific source lines which are being miscompiled and describe the changes you made to avoid the problem. Anything you can do to help us quickly zero in on the the problem without having to try to understand the entire application and rocPRIM library will save some time.

b-sumner avatar Feb 05 '19 14:02 b-sumner

@b-sumner I've created a branch with the workaround disabled and added some instructions. I hope, it's more clear now.

ex-rzr avatar Feb 05 '19 15:02 ex-rzr

This issue is not fixed in ROCm 2.3.

But good news: I managed to create a small test case without external dependencies: https://gist.github.com/ex-rzr/98e429dad13aa17ba1e67fc7145ba827

hipcc vcc.cpp -o vcc && ./vcc

The last 10 items are a "canary" area, they must be 0 (not modified by the kernel). Like this (MI25, gfx900):

...
995	995
996	996
997	997
998	998
999	999
1000	0
1001	0
1002	0
1003	0
1004	0
1005	0
1006	0
1007	0
1008	0
1009	0

However, on Fiji (gfx803)

...
995	995
996	996
997	997
998	998
999	999
1000	1000
1001	1001
1002	1002
1003	1003
1004	1004
1005	1005
1006	1006
1007	1007
1008	1008
1009	1009

(So it's just a good luck that there is no "Memory access fault").

The kernel looks weird and stupid, but the problem with this issue is that any change in the kernel, types, etc. may lead to correct instructions.

Here are ISAs: vcc.zip

Again, v_add* modifies vcc, s_cbranch_vcc* uses this irrelevant value (v_cmp* is missing):

	v_add_u32_e32 v3, vcc, v3, v0                              // 000000001140: 32060103 
	v_mov_b32_e32 v2, s1                                       // 000000001144: 7E040201 
	v_add_u32_e64 v1, s[0:1], s0, v1                           // 000000001148: D1190001 00020200 
	s_and_b64 vcc, exec, vcc                                   // 000000001150: 86EA6A7E 
	v_addc_u32_e64 v2, s[0:1], 0, v2, s[0:1]                   // 000000001154: D11C0002 00020480 
	s_cbranch_vccnz BB0_2                                      // 00000000115C: BF870007 

gfx900 has correct instructions v_cmp_eq_u32_e32 vcc -> s_cbranch_vccnz.

	v_cmp_eq_u32_e32 vcc, s8, v3                               // 0000000011E4: 7D940608 
	s_and_b64 vcc, exec, vcc                                   // 0000000011E8: 86EA6A7E 
	v_add_u32_e32 v3, v4, v0                                   // 0000000011EC: 68060104 
	s_cbranch_vccnz BB0_2                                      // 0000000011F0: BF870007 

ex-rzr avatar Apr 19 '19 11:04 ex-rzr

Another example (a test case from rocPRIM):

gfx803:

	v_add_u32_e32 v1, vcc, s1, v1                              // 00000000111C: 32020201 
	s_and_b64 vcc, exec, vcc                                   // 000000001120: 86EA6A7E 
	s_cbranch_vccnz BB0_2                                      // 000000001124: BF870073

gfx900:

	v_cmp_eq_u32_e32 vcc, s8, v1                               // 0000000011B8: 7D940208 
	s_and_b64 vcc, exec, vcc                                   // 0000000011BC: 86EA6A7E 
	v_add_u32_e32 v1, s2, v2                                   // 0000000011C0: 68020402 
	s_cbranch_vccnz BB0_2                                      // 0000000011C4: BF870069

test_hip_counting_iterator.zip

ex-rzr avatar Apr 19 '19 11:04 ex-rzr

Thank you for the helpful new standalone code. And let me apologize for the delay on getting to this.

b-sumner avatar Apr 19 '19 15:04 b-sumner

https://github.com/llvm-mirror/llvm/commit/aa2f57df85af2056eafbef94380664a55570fa83 should fix this particular case

arsenm avatar May 09 '19 10:05 arsenm

@b-sumner, can you comment on when HCC might pickup this fix?

skeelyamd avatar Jun 05 '19 06:06 skeelyamd

On Jun 4, 2019, at 8:56 PM, Sean Keely [email protected] wrote:

@b-sumner https://github.com/b-sumner or @arsenm https://github.com/arsenm: It looks like the compiler fix is in. Can we close this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RadeonOpenCompute/hcc/issues/1024?email_source=notifications&email_token=AABBYY7YL3Y3Y334CSZLANTPY4FK5A5CNFSM4GUK4ZJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW6JCMA#issuecomment-498897200, or mute the thread https://github.com/notifications/unsubscribe-auth/AABBYY75EX4NKQO2VKFGS5TPY4FK5ANCNFSM4GUK4ZJQ.

I don’t have the option of closing it

arsenm avatar Jun 05 '19 13:06 arsenm

Currently the fix is only available in HCC's development branch (clang_tot_upgrade), you'll have to build the compiler from source to get it.

scchan avatar Jun 05 '19 15:06 scchan