ROCm-OpenCL-Runtime icon indicating copy to clipboard operation
ROCm-OpenCL-Runtime copied to clipboard

Darktable opencl compute results are wrong with rocm (compared to amdgpu-pro and others)

Open arigit opened this issue 4 years ago • 23 comments

For about one year of ROCM releases, an opencl compute problem has been introduced that is impacting one of othe openCL kernels used in Darktable. Devs have proven that the compute results obtained with ROCM vs those obtained with AMDGPU-pro openCL are different. The results of AMDGPU-pro are the same as those obtained with other openCL compute hardware e.g. nvidia which seems to confirm that the problem is specific to ROCM. Also, as discussed in the Github issues below, very old ROCM (from mid or end of 2018) did not exhibit this issue.

Here are more extensive details and tests: https://github.com/darktable-org/darktable/issues/3756#

An issue was opened with rocm one year ago, here: https://github.com/RadeonOpenCompute/ROCm/issues/704

However nobody from from the ROCM team looked into it. Hereby opening this issue against the openCL runtime with the hope that the ROCM team experts will take notice and provide guidance on what to test to narrow down this problem and hopefully lead to resolution. Currently anyone with AMD GPU hardware doing photo editing with darktable on modern linux distributions is at a severe disadvantage vs. having non AMD hardware.

arigit avatar Jan 18 '20 18:01 arigit

@JasonTTang or @kzhuravl can we get help from AMD to fix that issue?

cryptomilk avatar Jan 26 '20 14:01 cryptomilk

I believe I saw that the issue disappears when "-O0" or "-cl-opt-disable" is added to the BuildProgram options. Is that correct? Anyway, what would be ideal here to quickly focus attention on the problem would be:

  • identify the one (or more?) kernels producing unexpected results
  • write a minimal standalone app which launches the problem kernel with known data whose result is known and
  • compares the known result to the standalone app result and reports pass or fail

I.e. a unit test for the kernel (or each kernel?). Ideally the developer would already have such a test for each kernel used by the application.

With such a test (and without any expertise in this application area or of this application) we can quickly evaluate the problem, and check that our fix actually works.

b-sumner avatar Jan 27 '20 15:01 b-sumner

Thanks for jumping in. I believe some extensive analysis was done that could perhaps be repurposed the way you need it. Asking RVrijSselt to join this discussion

arigit avatar Jan 27 '20 18:01 arigit

@b-sumner Yes the issue disappears when optimization is disabled.

Please see the test kernel and results with/without optimization and the difference in results here:

https://github.com/darktable-org/darktable/issues/3756#issuecomment-582244839

Test kernel https://github.com/RvRijsselt/dt-test_locallaplaciancl (huge thanks to RvRijsselt for the effort)

amdgpu-pro and early ROCM versions (prior to 1.6 if I recall correctly) did not exhibit this issue

Hoping this will allow tracking down the regression.

arigit avatar Feb 05 '20 05:02 arigit

@b-sumner just to bring to the AMD team attention the extensive testing done running the same opencl test kernel linked above, that shows that with Intel/Nvidia the results are the same as with ROCM with optimization disabled, and that when optimization is activated the ROCM opencl results are wrong - clearly this kernel is triggering some bug in rocm-opencl. This has been status-quo with AMD for more than a year now.

arigit avatar Feb 11 '20 22:02 arigit

Hi @arigit I was away at a meeting last week...sorry for the delay. I thank you and RvRijsselt for putting this test together. However, I think I may not have explained properly what I was wanting. It looks to me like dt_local_laplacian_cl will launch 20 kernels or more. Offhand, I'm guessing that laplacian_assemble is the problematic one. Would it be possible for you to verify that my guess is correct and chop down the test code to just launch that kernel (or whatever the problematic one is) a single time and dump out the results and indicate the incorrect pixels?

b-sumner avatar Feb 11 '20 23:02 b-sumner

Thanks for the quick response - this is all RvRijsselt work; let me pass along the feedback to the darktable team

arigit avatar Feb 12 '20 03:02 arigit

A new program has been added to test only the laplacian_assemble kernel. Run with ./testsinglekernel -O0 for passing test, or -O1 for failing. (or run both together with pr -T -m <(./testsinglekernel -O0) <(./testsinglekernel -O1):)

Example output and the output images (post_actual_output_1.pfm, left -O0 and right -O1): OutputWithAMDVega56Rocm.txt

Sides

If you look at the switch statement in the kernel: The first 53 pixels fall in case 0 and match exactly. Pixels 54-67 fall in case 1, small error. Pixels 68-80 fall in case 2, big negative numbers. Pixels 81-96 fall in case 3, big negative numbers. Pixels 97-128 fall in case 4, big negative numbers. Pixels 129-256 fall in case 4 and match again.

The laplacian() function can be simplified to return read_imagef(fine, sampleri, (int2)(0,0)).x; and the final output to pixel.x = l0; and you will still see that these pixels are different. It almost seems like the input images are in an incorrect order...

RvRijsselt avatar Feb 14 '20 00:02 RvRijsselt

Thanks @RvRijsselt !

Here's another run output on different hardware (RX560) and a different ROCM version, it fails in the exact same manner/same place as reported above when optimization is on

OutputWithAMDRX560ROCM.log.txt

This is the test code: https://github.com/RvRijsselt/dt-test_locallaplaciancl/blob/master/testsinglekernel.c

@b-sumner please advise if this is enough input for the AMD team to identify the problem - hoping we can get a fix in the next ROCM release

arigit avatar Feb 15 '20 20:02 arigit

I have added another test program with an even more minimal example. See my updated repo.

It confirms that the optimization (-O1 at least) causes the switch cases to read the wrong input image.

For another test I changed the switch cases into separate if statements then: ./testlocallaplaciancl -O0 gives Result: passed, as expected ./testlocallaplaciancl -O1 gives Result: passed, this suddenly works ./testlocallaplaciancl -O2 gives Result: failed, it looks like the image reads in the if statements are optimized away

RvRijsselt avatar Feb 17 '20 22:02 RvRijsselt

Thank you @RvRijsselt. testlocallaplaciancl is failing for me above -O0. But I see that it is launching multiple kernels. Would it be possible to cut this down even further? It looks like there is a start to this in testeasierkernel.c but it is currently failing even at -O0.

b-sumner avatar Feb 17 '20 22:02 b-sumner

Sorry I forgot to make a pass/fail check for the easier kernel test. (update) I just added it (update)

These are the executables that are created, the second one does what you need:

  • testlocallaplaciancl: the full program that shows pass/fail
  • testsinglekernel: executes only a single pass of the laplacian_assemble kernel and shows pass/fail
  • testeasierkernel: executes a smaller kernel and outputs pass/fail

I used the easier kernel mainly to debug which images were read. Switch case 0 should read image with pixel value 0.01, case 1 should read value 0.11, etc. When executing it with -O1 then switch cases 0, 1 and 2 all read the same image 0 and cases 3 and 4 read only image 2. It looks like the optimization step causes the wrong image to be read.

RvRijsselt avatar Feb 17 '20 23:02 RvRijsselt

Thanks for all the help. This will allow us to focus immediately on the problem code.

b-sumner avatar Feb 18 '20 00:02 b-sumner

@b-sumner was the AMD able to find the root cause of the issue? can we expect this to be addressed sometime soon? the darktable team is on hold on this. We (AMD GPU + darktable users) are really keen on seeing this addressed, it's been sometime

arigit avatar Mar 19 '20 14:03 arigit

Sorry, not yet. I'll see if we can get this looked at soon.

b-sumner avatar Mar 19 '20 17:03 b-sumner

@b-sumner Did anything happen in the last two month, we are still waiting for a fix ...

cryptomilk avatar May 13 '20 15:05 cryptomilk

We have been able to spend some time on this, but unfortunately not enough yet to root cause and come up with a fix.

b-sumner avatar May 13 '20 16:05 b-sumner

We don't have a fix yet, but here is an additional compile option that worked for us: -Wb,-simplifycfg-sink-common=0

b-sumner avatar May 13 '20 23:05 b-sumner

Hi i am having similar issues as posted in #115 for me the problems seem to occur when using <<= operator, I will try the -WB, -simplifycfg-sink-common=0 workaround and report if that helped in my case as well.

I think my issues can also be used as a good isolated test case as it is a relatively small single kernel.

Dantali0n avatar May 17 '20 07:05 Dantali0n

@b-sumner using:

AMD_OCL_BUILD_OPTIONS_APPEND="-Wb,-simplifycfg-sink-common=0" darktable

works, but it is just a workaround. Will there be a real fix available anytime soon?

cryptomilk avatar Jul 17 '20 06:07 cryptomilk

The work is in progress, but not complete. It won't be released "soon", but I will update this when upstream LLVM will have it.

b-sumner avatar Jul 17 '20 14:07 b-sumner

Any updates on this? Which upstream LLVM release is expected to have the fix?

Sturmflut avatar Jan 30 '21 09:01 Sturmflut

What is the current state of this bug? We currently operate several MI100 on ROCm 4.3.0 for scientific users, and would like to be sure that the results are correct.

Sturmflut avatar Sep 01 '21 19:09 Sturmflut

I apologize for not getting back to update this. I believe the fix was in ROCm 3.10 and onwards.

b-sumner avatar Mar 27 '23 21:03 b-sumner

The test program runs fine now with the latest drivers. Thank you for the fix. This issue can be closed.

Darktable local laplacian test
  Device: gfx900:xnack-
  Hardware version: OpenCL 2.0
  Software version: 3513.0 (HSA1.1,LC)
  OpenCL C version: OpenCL C 2.0
  Build options: -O2
Output data:
-4.23  -2.96  -1.72  -0.49  00.73  01.93  03.12  04.32  05.54  06.82
08.94  10.17  11.34  12.48  13.59  14.68  15.74  16.79  17.85  18.96
20.11  21.23  22.33  23.43  24.53  25.65  26.75  27.82  28.90  30.01
30.39  31.49  32.54  33.57  34.58  35.59  36.59  37.60  38.63  39.74
39.91  41.08  42.22  43.31  44.37  45.40  46.40  47.39  48.41  49.47
49.59  50.68  51.72  52.74  53.74  54.74  55.74  56.76  57.84  58.99
59.32  60.47  61.54  62.56  63.55  64.53  65.49  66.45  67.45  68.52
69.20  70.30  71.37  72.41  73.45  74.50  75.56  76.60  77.66  78.76
79.90  81.02  82.09  83.12  84.12  85.13  86.14  87.18  88.26  89.41
91.97  93.23  94.46  95.67  96.86  98.02  99.18  100.33  101.51  102.73
Different values: 0
Result: passed

RvRijsselt avatar Dec 11 '23 23:12 RvRijsselt

Thanks @RvRijsselt for the confirmation, thanks AMD for the fix. Unfortunately it came a few years late for some us but better late than never :)

arigit avatar Dec 14 '23 00:12 arigit