boinc
boinc copied to clipboard
not being able to crunch successfully GPU WUs on ArchLInux with default systemd settings
Describe the bug
I am not being able to complete any GPU task on my ArchLinux based system using an AMD Radeon 5700XT.
For some projects, like milkyway@home the task errors out imemdiately, for some others like primegrid they proceed very slow and are never completed (Computation Error) and restart.
This happens only when I start boinc using systemd, I realized running it from another folder and as another user was actually working so I started to remove systemd options until I observed that commenting out ProtectSystem=strict
was the source of the issue.
The error wasn't very clear at first sight as I got the following error message:
[13:01:23][57640][INFO ] Application startup - thank you for supporting Einstein@Home!
[13:01:23][57640][INFO ] Starting data processing...
[13:01:23][57640][INFO ] Using OpenCL platform provided by: Advanced Micro Devices, Inc.
[13:01:23][57640][INFO ] Using OpenCL device "gfx1031" by: Advanced Micro Devices, Inc.
[13:01:23][57640][ERROR] Couldn't create OpenCL command queue (error: -6)!
[13:01:23][57640][INFO ] OpenCL shutdown complete!
[13:01:23][57640][ERROR] Demodulation failed (error: 2013)!
[13:01:23][57640][WARN ] Sorry, at the moment your system doesn't have enough free CPU/GPU memory to run this task!
possibly related to https://github.com/BOINC/boinc/issues/3355 although it seems more focused on virtual boxes
Steps To Reproduce
- Install boinc from the official ArchLinux repo
pacman -S boinc
- install OpenCL
yay -S opencl-amd opencl-headers
(note that every combination of opencl was leading to the same issue, mesa, rocm, ...) - Add a project and start doing some GPU work
-
Computation Error
appears for every task
Expected behavior GPU tasks should be able to complete
Screenshots If applicable, add screenshots to help explain your problem.
System Information
- OS: ArchLinux 5.19.12
- BOINC Version: 7.20.2
Additional context Simple scritpts creating OpenCL queues are working as expected
I confirm this. It is not specific to Arch, but it does seem specific to AMD GPUs dependent on ROCm for OpenCL, at least for Einstein@Home project. That is, it doesn't appear to be an issue with GPUs running on the 'legacy' OpenCL.
https://boinc.berkeley.edu/forum_thread.php?id=14786
The puzzling thing is why this is only an issue after BOINC 7.16.17 (in my experience anyway), since the systemd hardening was in 78035bc14ef85fd2a69127271cc201c00d6e9730 and that seems to be present since BOINC 7.16.1.
To summarize: ProtectSystem=strict
causes OpenCL jobs to fail with error -6 (not enough memory). The actual reason is that AMD comgr
(compiler support library for ROCm LLVM) needs to compile some stuff under /tmp/comgr-XXXXXX
(where XXXXXX
is a random alphanumeric string) which it can't, because whole /tmp
is read-only.
Common workaround was to downgrade protection by using ProtectSystem=full
(or remove it completely), but almost-fix would be to add PrivateTmp=true
and this is actually suggested in boinc-client.service
, although commented (because "Since Atlas requires setuid root, they break Atlas"). Also, PrivateTmp=true
has a comment: "#Block X11 idle detection"
This seems to have been resolved by #4953 (since BOINC 7.22.0). At least, running 7.24.1 no longer requires me to fudge the systemd configuration (was only running 7.20.5 previously).