ROCm-OpenCL-Runtime Darktable opencl compute results are wrong with rocm (compared to amdgpu-pro and others)

For about one year of ROCM releases, an opencl compute problem has been introduced that is impacting one of othe openCL kernels used in Darktable. Devs have proven that the compute results obtained with ROCM vs those obtained with AMDGPU-pro openCL are different. The results of AMDGPU-pro are the same as those obtained with other openCL compute hardware e.g. nvidia which seems to confirm that the problem is specific to ROCM. Also, as discussed in the Github issues below, very old ROCM (from mid or end of 2018) did not exhibit this issue.

Here are more extensive details and tests: https://github.com/darktable-org/darktable/issues/3756#

An issue was opened with rocm one year ago, here: https://github.com/RadeonOpenCompute/ROCm/issues/704

However nobody from from the ROCM team looked into it. Hereby opening this issue against the openCL runtime with the hope that the ROCM team experts will take notice and provide guidance on what to test to narrow down this problem and hopefully lead to resolution. Currently anyone with AMD GPU hardware doing photo editing with darktable on modern linux distributions is at a severe disadvantage vs. having non AMD hardware.

Jan 18 '20 18:01 arigit

@JasonTTang or @kzhuravl can we get help from AMD to fix that issue?

Jan 26 '20 14:01 cryptomilk

I believe I saw that the issue disappears when "-O0" or "-cl-opt-disable" is added to the BuildProgram options. Is that correct? Anyway, what would be ideal here to quickly focus attention on the problem would be:

identify the one (or more?) kernels producing unexpected results
write a minimal standalone app which launches the problem kernel with known data whose result is known and
compares the known result to the standalone app result and reports pass or fail

I.e. a unit test for the kernel (or each kernel?). Ideally the developer would already have such a test for each kernel used by the application.

With such a test (and without any expertise in this application area or of this application) we can quickly evaluate the problem, and check that our fix actually works.

Jan 27 '20 15:01 b-sumner

Thanks for jumping in. I believe some extensive analysis was done that could perhaps be repurposed the way you need it. Asking RVrijSselt to join this discussion

Jan 27 '20 18:01 arigit

@b-sumner Yes the issue disappears when optimization is disabled.

Please see the test kernel and results with/without optimization and the difference in results here:

https://github.com/darktable-org/darktable/issues/3756#issuecomment-582244839

Test kernel https://github.com/RvRijsselt/dt-test_locallaplaciancl (huge thanks to RvRijsselt for the effort)

amdgpu-pro and early ROCM versions (prior to 1.6 if I recall correctly) did not exhibit this issue

Hoping this will allow tracking down the regression.

Feb 05 '20 05:02 arigit

@b-sumner just to bring to the AMD team attention the extensive testing done running the same opencl test kernel linked above, that shows that with Intel/Nvidia the results are the same as with ROCM with optimization disabled, and that when optimization is activated the ROCM opencl results are wrong - clearly this kernel is triggering some bug in rocm-opencl. This has been status-quo with AMD for more than a year now.

Feb 11 '20 22:02 arigit

Hi @arigit I was away at a meeting last week...sorry for the delay. I thank you and RvRijsselt for putting this test together. However, I think I may not have explained properly what I was wanting. It looks to me like dt_local_laplacian_cl will launch 20 kernels or more. Offhand, I'm guessing that laplacian_assemble is the problematic one. Would it be possible for you to verify that my guess is correct and chop down the test code to just launch that kernel (or whatever the problematic one is) a single time and dump out the results and indicate the incorrect pixels?

Feb 11 '20 23:02 b-sumner

Thanks for the quick response - this is all RvRijsselt work; let me pass along the feedback to the darktable team

Feb 12 '20 03:02 arigit

A new program has been added to test only the laplacian_assemble kernel. Run with ./testsinglekernel -O0 for passing test, or -O1 for failing. (or run both together with pr -T -m <(./testsinglekernel -O0) <(./testsinglekernel -O1):)

Example output and the output images (post_actual_output_1.pfm, left -O0 and right -O1): OutputWithAMDVega56Rocm.txt

Sides

If you look at the switch statement in the kernel: The first 53 pixels fall in case 0 and match exactly. Pixels 54-67 fall in case 1, small error. Pixels 68-80 fall in case 2, big negative numbers. Pixels 81-96 fall in case 3, big negative numbers. Pixels 97-128 fall in case 4, big negative numbers. Pixels 129-256 fall in case 4 and match again.

The laplacian() function can be simplified to return read_imagef(fine, sampleri, (int2)(0,0)).x; and the final output to pixel.x = l0; and you will still see that these pixels are different. It almost seems like the input images are in an incorrect order...

Feb 14 '20 00:02 RvRijsselt

Thanks @RvRijsselt !

Here's another run output on different hardware (RX560) and a different ROCM version, it fails in the exact same manner/same place as reported above when optimization is on

OutputWithAMDRX560ROCM.log.txt

This is the test code: https://github.com/RvRijsselt/dt-test_locallaplaciancl/blob/master/testsinglekernel.c

@b-sumner please advise if this is enough input for the AMD team to identify the problem - hoping we can get a fix in the next ROCM release

Feb 15 '20 20:02 arigit

I have added another test program with an even more minimal example. See my updated repo.

It confirms that the optimization (-O1 at least) causes the switch cases to read the wrong input image.

For another test I changed the switch cases into separate if statements then: ./testlocallaplaciancl -O0 gives Result: passed, as expected ./testlocallaplaciancl -O1 gives Result: passed, this suddenly works ./testlocallaplaciancl -O2 gives Result: failed, it looks like the image reads in the if statements are optimized away

Feb 17 '20 22:02 RvRijsselt

Thank you @RvRijsselt. testlocallaplaciancl is failing for me above -O0. But I see that it is launching multiple kernels. Would it be possible to cut this down even further? It looks like there is a start to this in testeasierkernel.c but it is currently failing even at -O0.

Feb 17 '20 22:02 b-sumner

Sorry I forgot to make a pass/fail check for the easier kernel test. (update) I just added it (update)

These are the executables that are created, the second one does what you need:

testlocallaplaciancl: the full program that shows pass/fail
testsinglekernel: executes only a single pass of the laplacian_assemble kernel and shows pass/fail
testeasierkernel: executes a smaller kernel and outputs pass/fail

I used the easier kernel mainly to debug which images were read. Switch case 0 should read image with pixel value 0.01, case 1 should read value 0.11, etc. When executing it with -O1 then switch cases 0, 1 and 2 all read the same image 0 and cases 3 and 4 read only image 2. It looks like the optimization step causes the wrong image to be read.

Feb 17 '20 23:02 RvRijsselt

Thanks for all the help. This will allow us to focus immediately on the problem code.

Feb 18 '20 00:02 b-sumner

@b-sumner was the AMD able to find the root cause of the issue? can we expect this to be addressed sometime soon? the darktable team is on hold on this. We (AMD GPU + darktable users) are really keen on seeing this addressed, it's been sometime

Mar 19 '20 14:03 arigit

Sorry, not yet. I'll see if we can get this looked at soon.

Mar 19 '20 17:03 b-sumner

@b-sumner Did anything happen in the last two month, we are still waiting for a fix ...

May 13 '20 15:05 cryptomilk

We have been able to spend some time on this, but unfortunately not enough yet to root cause and come up with a fix.

May 13 '20 16:05 b-sumner

We don't have a fix yet, but here is an additional compile option that worked for us: -Wb,-simplifycfg-sink-common=0

May 13 '20 23:05 b-sumner

Hi i am having similar issues as posted in #115 for me the problems seem to occur when using <<= operator, I will try the -WB, -simplifycfg-sink-common=0 workaround and report if that helped in my case as well.

I think my issues can also be used as a good isolated test case as it is a relatively small single kernel.

May 17 '20 07:05 Dantali0n

@b-sumner using:

AMD_OCL_BUILD_OPTIONS_APPEND="-Wb,-simplifycfg-sink-common=0" darktable

works, but it is just a workaround. Will there be a real fix available anytime soon?

Jul 17 '20 06:07 cryptomilk

The work is in progress, but not complete. It won't be released "soon", but I will update this when upstream LLVM will have it.

Jul 17 '20 14:07 b-sumner

Any updates on this? Which upstream LLVM release is expected to have the fix?

Jan 30 '21 09:01 Sturmflut

What is the current state of this bug? We currently operate several MI100 on ROCm 4.3.0 for scientific users, and would like to be sure that the results are correct.

Sep 01 '21 19:09 Sturmflut

I apologize for not getting back to update this. I believe the fix was in ROCm 3.10 and onwards.

Mar 27 '23 21:03 b-sumner

The test program runs fine now with the latest drivers. Thank you for the fix. This issue can be closed.

Darktable local laplacian test
  Device: gfx900:xnack-
  Hardware version: OpenCL 2.0
  Software version: 3513.0 (HSA1.1,LC)
  OpenCL C version: OpenCL C 2.0
  Build options: -O2
Output data:
-4.23  -2.96  -1.72  -0.49  00.73  01.93  03.12  04.32  05.54  06.82
08.94  10.17  11.34  12.48  13.59  14.68  15.74  16.79  17.85  18.96
20.11  21.23  22.33  23.43  24.53  25.65  26.75  27.82  28.90  30.01
30.39  31.49  32.54  33.57  34.58  35.59  36.59  37.60  38.63  39.74
39.91  41.08  42.22  43.31  44.37  45.40  46.40  47.39  48.41  49.47
49.59  50.68  51.72  52.74  53.74  54.74  55.74  56.76  57.84  58.99
59.32  60.47  61.54  62.56  63.55  64.53  65.49  66.45  67.45  68.52
69.20  70.30  71.37  72.41  73.45  74.50  75.56  76.60  77.66  78.76
79.90  81.02  82.09  83.12  84.12  85.13  86.14  87.18  88.26  89.41
91.97  93.23  94.46  95.67  96.86  98.02  99.18  100.33  101.51  102.73
Different values: 0
Result: passed

Dec 11 '23 23:12 RvRijsselt

Thanks @RvRijsselt for the confirmation, thanks AMD for the fix. Unfortunately it came a few years late for some us but better late than never :)

Dec 14 '23 00:12 arigit

ROCm-OpenCL-Runtime ROCm-OpenCL-Runtime copied to clipboard

Darktable opencl compute results are wrong with rocm (compared to amdgpu-pro and others)

ROCm-OpenCL-Runtime
ROCm-OpenCL-Runtime copied to clipboard