ROCm-OpenCL-Runtime
ROCm-OpenCL-Runtime copied to clipboard
Darktable opencl compute results are wrong with rocm (compared to amdgpu-pro and others)
For about one year of ROCM releases, an opencl compute problem has been introduced that is impacting one of othe openCL kernels used in Darktable. Devs have proven that the compute results obtained with ROCM vs those obtained with AMDGPU-pro openCL are different. The results of AMDGPU-pro are the same as those obtained with other openCL compute hardware e.g. nvidia which seems to confirm that the problem is specific to ROCM. Also, as discussed in the Github issues below, very old ROCM (from mid or end of 2018) did not exhibit this issue.
Here are more extensive details and tests: https://github.com/darktable-org/darktable/issues/3756#
An issue was opened with rocm one year ago, here: https://github.com/RadeonOpenCompute/ROCm/issues/704
However nobody from from the ROCM team looked into it. Hereby opening this issue against the openCL runtime with the hope that the ROCM team experts will take notice and provide guidance on what to test to narrow down this problem and hopefully lead to resolution. Currently anyone with AMD GPU hardware doing photo editing with darktable on modern linux distributions is at a severe disadvantage vs. having non AMD hardware.
@JasonTTang or @kzhuravl can we get help from AMD to fix that issue?
I believe I saw that the issue disappears when "-O0" or "-cl-opt-disable" is added to the BuildProgram options. Is that correct? Anyway, what would be ideal here to quickly focus attention on the problem would be:
- identify the one (or more?) kernels producing unexpected results
- write a minimal standalone app which launches the problem kernel with known data whose result is known and
- compares the known result to the standalone app result and reports pass or fail
I.e. a unit test for the kernel (or each kernel?). Ideally the developer would already have such a test for each kernel used by the application.
With such a test (and without any expertise in this application area or of this application) we can quickly evaluate the problem, and check that our fix actually works.
Thanks for jumping in. I believe some extensive analysis was done that could perhaps be repurposed the way you need it. Asking RVrijSselt to join this discussion
@b-sumner Yes the issue disappears when optimization is disabled.
Please see the test kernel and results with/without optimization and the difference in results here:
https://github.com/darktable-org/darktable/issues/3756#issuecomment-582244839
Test kernel https://github.com/RvRijsselt/dt-test_locallaplaciancl (huge thanks to RvRijsselt for the effort)
amdgpu-pro and early ROCM versions (prior to 1.6 if I recall correctly) did not exhibit this issue
Hoping this will allow tracking down the regression.
@b-sumner just to bring to the AMD team attention the extensive testing done running the same opencl test kernel linked above, that shows that with Intel/Nvidia the results are the same as with ROCM with optimization disabled, and that when optimization is activated the ROCM opencl results are wrong - clearly this kernel is triggering some bug in rocm-opencl. This has been status-quo with AMD for more than a year now.
Hi @arigit I was away at a meeting last week...sorry for the delay. I thank you and RvRijsselt for putting this test together. However, I think I may not have explained properly what I was wanting. It looks to me like dt_local_laplacian_cl will launch 20 kernels or more. Offhand, I'm guessing that laplacian_assemble is the problematic one. Would it be possible for you to verify that my guess is correct and chop down the test code to just launch that kernel (or whatever the problematic one is) a single time and dump out the results and indicate the incorrect pixels?
Thanks for the quick response - this is all RvRijsselt work; let me pass along the feedback to the darktable team
A new program has been added to test only the laplacian_assemble kernel.
Run with ./testsinglekernel -O0
for passing test, or -O1
for failing.
(or run both together with pr -T -m <(./testsinglekernel -O0) <(./testsinglekernel -O1)
:)
Example output and the output images (post_actual_output_1.pfm, left -O0 and right -O1): OutputWithAMDVega56Rocm.txt
If you look at the switch statement in the kernel: The first 53 pixels fall in case 0 and match exactly. Pixels 54-67 fall in case 1, small error. Pixels 68-80 fall in case 2, big negative numbers. Pixels 81-96 fall in case 3, big negative numbers. Pixels 97-128 fall in case 4, big negative numbers. Pixels 129-256 fall in case 4 and match again.
The laplacian() function can be simplified to return read_imagef(fine, sampleri, (int2)(0,0)).x;
and the final output to pixel.x = l0;
and you will still see that these pixels are different. It almost seems like the input images are in an incorrect order...
Thanks @RvRijsselt !
Here's another run output on different hardware (RX560) and a different ROCM version, it fails in the exact same manner/same place as reported above when optimization is on
OutputWithAMDRX560ROCM.log.txt
This is the test code: https://github.com/RvRijsselt/dt-test_locallaplaciancl/blob/master/testsinglekernel.c
@b-sumner please advise if this is enough input for the AMD team to identify the problem - hoping we can get a fix in the next ROCM release
I have added another test program with an even more minimal example. See my updated repo.
It confirms that the optimization (-O1 at least) causes the switch cases to read the wrong input image.
For another test I changed the switch cases into separate if statements then:
./testlocallaplaciancl -O0
gives Result: passed
, as expected
./testlocallaplaciancl -O1
gives Result: passed
, this suddenly works
./testlocallaplaciancl -O2
gives Result: failed
, it looks like the image reads in the if statements are optimized away
Thank you @RvRijsselt. testlocallaplaciancl is failing for me above -O0. But I see that it is launching multiple kernels. Would it be possible to cut this down even further? It looks like there is a start to this in testeasierkernel.c but it is currently failing even at -O0.
Sorry I forgot to make a pass/fail check for the easier kernel test. (update) I just added it (update)
These are the executables that are created, the second one does what you need:
- testlocallaplaciancl: the full program that shows pass/fail
- testsinglekernel: executes only a single pass of the laplacian_assemble kernel and shows pass/fail
- testeasierkernel: executes a smaller kernel and outputs pass/fail
I used the easier kernel mainly to debug which images were read. Switch case 0 should read image with pixel value 0.01, case 1 should read value 0.11, etc. When executing it with -O1 then switch cases 0, 1 and 2 all read the same image 0 and cases 3 and 4 read only image 2. It looks like the optimization step causes the wrong image to be read.
Thanks for all the help. This will allow us to focus immediately on the problem code.
@b-sumner was the AMD able to find the root cause of the issue? can we expect this to be addressed sometime soon? the darktable team is on hold on this. We (AMD GPU + darktable users) are really keen on seeing this addressed, it's been sometime
Sorry, not yet. I'll see if we can get this looked at soon.
@b-sumner Did anything happen in the last two month, we are still waiting for a fix ...
We have been able to spend some time on this, but unfortunately not enough yet to root cause and come up with a fix.
We don't have a fix yet, but here is an additional compile option that worked for us: -Wb,-simplifycfg-sink-common=0
Hi i am having similar issues as posted in #115 for me the problems seem to occur when using <<=
operator, I will try the -WB, -simplifycfg-sink-common=0
workaround and report if that helped in my case as well.
I think my issues can also be used as a good isolated test case as it is a relatively small single kernel.
@b-sumner using:
AMD_OCL_BUILD_OPTIONS_APPEND="-Wb,-simplifycfg-sink-common=0" darktable
works, but it is just a workaround. Will there be a real fix available anytime soon?
The work is in progress, but not complete. It won't be released "soon", but I will update this when upstream LLVM will have it.
Any updates on this? Which upstream LLVM release is expected to have the fix?
What is the current state of this bug? We currently operate several MI100 on ROCm 4.3.0 for scientific users, and would like to be sure that the results are correct.
I apologize for not getting back to update this. I believe the fix was in ROCm 3.10 and onwards.
The test program runs fine now with the latest drivers. Thank you for the fix. This issue can be closed.
Darktable local laplacian test
Device: gfx900:xnack-
Hardware version: OpenCL 2.0
Software version: 3513.0 (HSA1.1,LC)
OpenCL C version: OpenCL C 2.0
Build options: -O2
Output data:
-4.23 -2.96 -1.72 -0.49 00.73 01.93 03.12 04.32 05.54 06.82
08.94 10.17 11.34 12.48 13.59 14.68 15.74 16.79 17.85 18.96
20.11 21.23 22.33 23.43 24.53 25.65 26.75 27.82 28.90 30.01
30.39 31.49 32.54 33.57 34.58 35.59 36.59 37.60 38.63 39.74
39.91 41.08 42.22 43.31 44.37 45.40 46.40 47.39 48.41 49.47
49.59 50.68 51.72 52.74 53.74 54.74 55.74 56.76 57.84 58.99
59.32 60.47 61.54 62.56 63.55 64.53 65.49 66.45 67.45 68.52
69.20 70.30 71.37 72.41 73.45 74.50 75.56 76.60 77.66 78.76
79.90 81.02 82.09 83.12 84.12 85.13 86.14 87.18 88.26 89.41
91.97 93.23 94.46 95.67 96.86 98.02 99.18 100.33 101.51 102.73
Different values: 0
Result: passed
Thanks @RvRijsselt for the confirmation, thanks AMD for the fix. Unfortunately it came a few years late for some us but better late than never :)