stripes with opencl and denoise
Describe the bug
The issue is similar to #15589 and #16378.
Using darktable master (compiled from source, Debian/testing, rocm) I get a repeated pixel band on images. I have narrowed it down to the denoise (profiled) module (see screenshot).
The band changes appearance when I zoom in/out, but remains on the right side.
Using eg RCD instead of LMMSE in demosaicing increases the band size, or moves it to a corner (similar to #16378).
Steps to reproduce
This is not image-specific, but I can upload the image and the xmp if requested (screenshot shows enabled modules).
Expected behavior
No response
Logfile | Screenshot | Screencast
Commit
No response
Where did you obtain darktable from?
self compiled
darktable version
4.9.0+78~g77474ec716
What OS are you using?
Linux
What is the version of your OS?
Debian testing
Describe your system?
Integrated GPU. ROCM log attached.
Are you using OpenCL GPU in darktable?
Yes
If yes, what is the GPU card and driver?
AMD Ryzen 5 5600H with Radeon Graphics, Vesa 7 maybe?
Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip
No response
Oh, unfortunately AMD drivers are notorious for instable drivers on certain devices and the supported models seem to change ...
At least we would require a log with '-d pipe -d opencl' options to investigate this issue.
Also you should test 'clinfo' and share it's output.
thanks for getting back to me so quickly. attaching both the -d pipe -d opelcl and the clinfo logs.
Rocm does not really support integrated GPU. The drivers are reporting only 512mb of ram and dedicated memory instead of share. This is incorrect Karol pushed an update to mesa that fixes the incorrect memory size reporting, but it still reports as dedicated.
Also, your dt resources are set to small. I think you should use default.
@gi-man: I get the bug at all resource settings, "small" is a remant from experimentation (was a suggestion in a similar issue).
So, if I understand correctly: this kind of integrated GPU is useless for the purposes of Darktable, it just happens to be misreported? Or can I still get some benefit out of it?
So, if I understand correctly: this kind of integrated GPU is useless for the purposes of Darktable
I think you will have to wait until (hopefully) AMD fixes the driver :-)
Anyway, thanks for the logs, some details pinpointed to #17203
@tpapp how did you install the GPU driver in ubuntu? Are you using the amdgpu-install package?
Just telling: this constant flow of incoming suspected dt issues related to one OpenCL driver vendor is outstanding and driving me mad.
- Chances for any dt developer including me to investigate AMD OpenCL issues - leaving out rusticl as we have good support there - are diminishing.
- Maybe we should disable that driver by default and mark it as "use on own risk" in preferences?
- Or is is not the vendor but the distributions ?
Just telling: this constant flow of incoming suspected dt issues related to one OpenCL driver vendor is outstanding and driving me mad.
1. Chances for any dt developer including me to investigate AMD OpenCL issues - leaving out rusticl as we have good support there - are diminishing. 2. Maybe we should disable that driver by default and mark it as "use on own risk" in preferences? 3. Or is is **not** the vendor but the distributions ?
@jenshannoschwalm sorry if I have triggered your AMD OpenCL pain-point again with my already closed issue and this post. I'm actually really grateful for all your amazing OpenCL subsystem contributions and support in darktable :bow:
<offtopic>The sole purpose of my post above was to find out more about the AMD driver / compatibility situation, as I'm planning to buy a laptop with an AMD Ryzen 7 8845HS CPU with an integrated AMD GPU. If apps such as darktable cannot take advantage of the GPU processing, this would be a deal-breaker for me. But as it seems in another issue, rusticl seems to work well with those GPUs already.</offtopic>
sorry if I have triggered your AMD OpenCL pain-point again ...
You didn't !
Unfortunately there is no core dev using (1) AMD hardware, not even speaking of (2) rolling release distros or (3) not-up-to-date as based on ub 22.xx. So we just get "vague" reports.
In case (2) there is almost never a chance to reproduce as it's not clear what people did in detail
Also we couldn't find a "trigger" over the last year or so. Until then it was mostly a problem of handling NaNs, AMD drivers just seem not to care. Another trigger seemed to be the interpolator for imagereadf.
I have an idea as you seem to be active on pixls too. Would you be able and interested in making sort of a review for AMD OpenCL there? Questions would be a) problems with AMD driver ? b) performance AMD driver vs rustiCL ? So we might switch to rusti as default ?
sorry if I have triggered your AMD OpenCL pain-point again ...
You didn't !
Unfortunately there is no core dev using (1) AMD hardware, not even speaking of (2) rolling release distros or (3) not-up-to-date as based on ub 22.xx. So we just get "vague" reports.
In case (2) there is almost never a chance to reproduce as it's not clear what people did in detail
Also we couldn't find a "trigger" over the last year or so. Until then it was mostly a problem of handling NaNs, AMD drivers just seem not to care. Another trigger seemed to be the interpolator for imagereadf.
I have an idea as you seem to be active on pixls too. Would you be able and interested in making sort of a review for AMD OpenCL there? Questions would be a) problems with AMD driver ? b) performance AMD driver vs rustiCL ? So we might switch to rusti as default ?
@jenshannoschwalm of course I'm interested to help, whatever it takes to make dt run smoothly also on all common platforms, including AMD GPUs. Do we already have a pixls.us thread for collecting early feedback on experimental and pre-release candidates from people which have a broad variety of OSses and GPUs? Is it already possible to create release builds from within PRs using the github workflows? I think @darix has been doing release builds and maybe can answer this question.
@da-phil: I have installed it from distro packages, specifically
tamas@tamas ~ % dpkg -l '*rocm*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===================-=========================-============-===================================================
ii rocm-device-libs 5.2.3-3 amd64 AMD specific device-side language runtime libraries
ii rocm-device-libs-17 6.0+git20231212.5a852ed-2 amd64 AMD specific device-side language runtime libraries
ii rocm-opencl-icd 5.7.1-4 amd64 ROCm implementation of OpenCL API - ICD runtime
ii rocminfo 5.7.1-3 amd64 ROCm Application for Reporting System Info
@jenshannoschwalm: Is there a test suite I could run on my machine that would spit out useful information? I agree that debugging from artifacts one sees in a GUI is difficult, but a test suite could compare outputs to expected inputs.
Ive shared this in other post but I will repeat it here. The drivers for integrated GPU (APU) AMD have issues and I dont think they are officially supported by rocm. The issue starts with incorrect memory reporting/management. I have an APU card in my system that I use to test dt once in a while. I'm on Fedora, ROCM is on 6.1.2 and it still has issues. Karol recently pushed a PR to address the card reporting only 500mb of memory. It still reports the memory as dedicated instead of shared/unified. I havent tested this change yet.
Ive seen very minimal reports/issues with dt when they use a dedicate GPU from AMD. Until the drivers issues are fixed upstream for the APU cards and then they reach the distros (eg. tpapp is on 5.7.1), I think we should blacklist the APU cards for linux systems. This might be hard to do since we need to uniquely identify APU vs GPU.
By the way, I have a Windows 11 partition. Ive tested dt with just CPU vs using the APU. The APU drivers work on Windows, but the performance improvement is negligible.
Is there a test suite I could run on my machine ...
Yes we have the integration suite in dt code base that would do it. It's also running nightly reporting regressions.
I'm on Fedora, ROCM is on 6.1.2 and it still has issues ...
On dt master we check at least for available OpenCL mem and bumped the requirement to 800MB. So those cards won't start at least :-) And it doesn't hurt.
It still reports the memory as dedicated instead of shared/unified. I havent tested this change yet.
So unfortunately we can't check that (yet).
Ive seen very minimal reports/issues with dt when they use a dedicate GPU from AMD
We had lot's of them but i think we found most of the dt opencl bugs by now.
This is the most recent discussion I found about glitches/stripes when using OpenCL with Darktable. I'm not sure if this is exactly the same, however, here are my findings:
- Setting: Using RX6600XT with 8GB on ArchLinux with rocm-opencl-runtime 6.0.2-1 (I tried the AUR package with ROCm 6.2, but I there was no difference)
- The amount of stripes seem to vary depending on how much graphics memory is given to Darktable (playing with "use all device memory" and "darktable resources" - the more memory, the less stripes there are)
- Not all images are affected, it basically depends on the enabled modules and input size of the image (in my case, I used a 45 MP raw image).
- The stripes only appear on exported images and if "high quality processing" is enabled
- They do not appear, if OpenCL is deactivated or if I use RustiCL. RustiCL, however, seems to be noticeably slower then the AMD OpenCL implementation
I played a lot with different parameters, and I guess I found one which makes the glitches/stripes disappear when using AMD ROCm, independent on how much graphics memory is given to Darktable: With "pinned memory" set to 1 (enforce pinned memory), I don't get the glitches/stripes! In the documentation, only performance impact was discussed, but it seems to have an effect on stability here as well.
It would be interesting to see if this also helps in your case, @tpapp
Cannot reproduce on
- darktable 4.8.1 on Fedora 40 (6.10.10-200.fc40.x86_64)
- AMD Radeon 680M
- rocm 6.2.1
This issue has been marked as stale due to inactivity for the last 60 days. It will be automatically closed in 300 days if no update occurs. Please check if the master branch has fixed it and report again or close the issue.
As commented here (https://github.com/darktable-org/darktable/issues/16378#issuecomment-2263729085), I can confirm that @denis-martin's workaround of turning on "memory pinning" works, however it comes with a slight performance penalty, as described in the comment.