boinc icon indicating copy to clipboard operation
boinc copied to clipboard

not being able to crunch successfully GPU WUs on ArchLInux with default systemd settings

Open MattBlack85 opened this issue 2 years ago • 3 comments

Describe the bug I am not being able to complete any GPU task on my ArchLinux based system using an AMD Radeon 5700XT. For some projects, like milkyway@home the task errors out imemdiately, for some others like primegrid they proceed very slow and are never completed (Computation Error) and restart. This happens only when I start boinc using systemd, I realized running it from another folder and as another user was actually working so I started to remove systemd options until I observed that commenting out ProtectSystem=strict was the source of the issue.

The error wasn't very clear at first sight as I got the following error message:

[13:01:23][57640][INFO ] Application startup - thank you for supporting Einstein@Home!
[13:01:23][57640][INFO ] Starting data processing...
[13:01:23][57640][INFO ] Using OpenCL platform provided by: Advanced Micro Devices, Inc.
[13:01:23][57640][INFO ] Using OpenCL device "gfx1031" by: Advanced Micro Devices, Inc.
[13:01:23][57640][ERROR] Couldn't create OpenCL command queue (error: -6)!
[13:01:23][57640][INFO ] OpenCL shutdown complete!
[13:01:23][57640][ERROR] Demodulation failed (error: 2013)!
[13:01:23][57640][WARN ] Sorry, at the moment your system doesn't have enough free CPU/GPU memory to run this task!

possibly related to https://github.com/BOINC/boinc/issues/3355 although it seems more focused on virtual boxes

Steps To Reproduce

  1. Install boinc from the official ArchLinux repo pacman -S boinc
  2. install OpenCL yay -S opencl-amd opencl-headers (note that every combination of opencl was leading to the same issue, mesa, rocm, ...)
  3. Add a project and start doing some GPU work
  4. Computation Error appears for every task

Expected behavior GPU tasks should be able to complete

Screenshots If applicable, add screenshots to help explain your problem.

System Information

  • OS: ArchLinux 5.19.12
  • BOINC Version: 7.20.2

Additional context Simple scritpts creating OpenCL queues are working as expected

MattBlack85 avatar Oct 03 '22 09:10 MattBlack85

I confirm this. It is not specific to Arch, but it does seem specific to AMD GPUs dependent on ROCm for OpenCL, at least for Einstein@Home project. That is, it doesn't appear to be an issue with GPUs running on the 'legacy' OpenCL.

https://boinc.berkeley.edu/forum_thread.php?id=14786

The puzzling thing is why this is only an issue after BOINC 7.16.17 (in my experience anyway), since the systemd hardening was in 78035bc14ef85fd2a69127271cc201c00d6e9730 and that seems to be present since BOINC 7.16.1.

Wedge009 avatar Oct 08 '22 00:10 Wedge009

To summarize: ProtectSystem=strict causes OpenCL jobs to fail with error -6 (not enough memory). The actual reason is that AMD comgr (compiler support library for ROCm LLVM) needs to compile some stuff under /tmp/comgr-XXXXXX (where XXXXXX is a random alphanumeric string) which it can't, because whole /tmp is read-only. Common workaround was to downgrade protection by using ProtectSystem=full (or remove it completely), but almost-fix would be to add PrivateTmp=true and this is actually suggested in boinc-client.service, although commented (because "Since Atlas requires setuid root, they break Atlas"). Also, PrivateTmp=true has a comment: "#Block X11 idle detection"

madmanxxx avatar Dec 22 '22 23:12 madmanxxx

This seems to have been resolved by #4953 (since BOINC 7.22.0). At least, running 7.24.1 no longer requires me to fudge the systemd configuration (was only running 7.20.5 previously).

Wedge009 avatar Feb 23 '24 23:02 Wedge009