icinga2 icon indicating copy to clipboard operation
icinga2 copied to clipboard

icinga2 may be removed from Debian armel, mips64el, ppc64el, and riscv64

Open noloader opened this issue 1 year ago • 8 comments

Describe the bug

icinga2 may be removed from Debian armel, mips64el, ppc64el, and riscv64 due to Failure to Build from Source (FTBFS).

To Reproduce

  • armel:

/<<PKGBUILDDIR>>/lib/base/application.hpp:148:33: error: static assertion failed

https://buildd.debian.org/status/fetch.php?pkg=icinga2&arch=armel&ver=2.14.1-1&stamp=1703225119&raw=0

  • mips64el:

152 - base-methods_pluginnotificationtask/truncate_long_output (Failed)

https://buildd.debian.org/status/fetch.php?pkg=icinga2&arch=mips64el&ver=2.14.1-1&stamp=1704107092&raw=0

  • ppc64el:

152 - base-methods_pluginnotificationtask/truncate_long_output (Failed)

https://buildd.debian.org/status/fetch.php?pkg=icinga2&arch=ppc64el&ver=2.14.1-1&stamp=1703225525&raw=0

  • riscv64:

/usr/bin/ld: /usr/lib/riscv64-linux-gnu/libboost_coroutine.so.1.83.0: undefined reference to jump_fcontext' /usr/bin/ld: /usr/lib/riscv64-linux-gnu/libboost_coroutine.so.1.83.0: undefined reference to make_fcontext'

https://buildd.debian.org/status/fetch.php?pkg=icinga2&arch=riscv64&ver=2.14.1-1&stamp=1703250691&raw=0

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version):
  • Operating System and version:
  • Enabled features (icinga2 feature list):
  • Icinga Web 2 version and modules (System - About):
  • Config validation (icinga2 daemon -C):
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

Additional context

I just lurk on some of the Debian mailing lists. I wanted to give the project a heads up. Also see https://lists.debian.org/debian-powerpc/2024/01/msg00001.html.

noloader avatar Jan 03 '24 05:01 noloader

  • armel:

/<>/lib/base/application.hpp:148:33: error: static assertion failed

buildd.debian.org/status/fetch.php?pkg=icinga2&arch=armel&ver=2.14.1-1&stamp=1703225119&raw=0

So looks like no lock-free implementation of std::atomic<uint32_t> available on that platform:

https://github.com/Icinga/icinga2/blob/2c9117b4f71e00b2072e7dbe6c4ea4e48c882a87/lib/base/application.hpp#L147-L148

If someone wants to make it work, go ahead, but I'm not sure if it's worth the time for some old ARM systems.

  • mips64el:

152 - base-methods_pluginnotificationtask/truncate_long_output (Failed)

buildd.debian.org/status/fetch.php?pkg=icinga2&arch=mips64el&ver=2.14.1-1&stamp=1704107092&raw=0

  • ppc64el:

152 - base-methods_pluginnotificationtask/truncate_long_output (Failed)

buildd.debian.org/status/fetch.php?pkg=icinga2&arch=ppc64el&ver=2.14.1-1&stamp=1703225525&raw=0

That was recently added in #9887. Not sure what's going wrong on these architectures. I don't think we have easy access to such machines so help for debugging this is greatly appreciated.

  • riscv64:

/usr/bin/ld: /usr/lib/riscv64-linux-gnu/libboost_coroutine.so.1.83.0: undefined reference to jump_fcontext' /usr/bin/ld: /usr/lib/riscv64-linux-gnu/libboost_coroutine.so.1.83.0: undefined reference to make_fcontext'

buildd.debian.org/status/fetch.php?pkg=icinga2&arch=riscv64&ver=2.14.1-1&stamp=1703250691&raw=0

That actually seems to be an issue in Boost which, according to Debian bug #1059046, should already be fixed there.

julianbrost avatar Jan 03 '24 09:01 julianbrost

Thanks @julianbrost,

Just one comment...

base-methods_pluginnotificationtask/truncate_long_output (Failed)

That was recently added in https://github.com/Icinga/icinga2/pull/9887. Not sure what's going wrong on these architectures. I don't think we have easy access to such machines so help for debugging this is greatly appreciated.

It will probably be easiest to use a KVM/QEMU virtual machine, though emulation will be a little slow. I endure it for a couple of projects I help with, like when I need a s390x machine.

The Gnulib folks recently added instructions for QEMU at https://git.savannah.gnu.org/gitweb/?p=gnulib/maint-tools.git;a=tree;f=platforms/environments/qemu. Once you setup a Debian based machine, switch from Bookworm/Stable to Unstable, then apt-get update && apt-get upgrade. That should get you to a similar machine that exhibits the issue.

And if you don't have time, then I understand that, too. I understand managing a project takes a lot of time.

noloader avatar Jan 03 '24 09:01 noloader

Thanks for pointing at the commit that introduced the issue. Looking at it and the affected architectures, it could be linked to the fact that the page size on those architectures is not 4K.

aurel32 avatar Jan 03 '24 10:01 aurel32

The issue seems to be an confusion between MAX_ARG_STRLEN, which represents the maximum length of a single argument, and syconf(_SC_ARG_MAX), which is the maximum length of the command. Note that the maximum length defaults ot 1/4 of the stack size, and can therefore be smaller than MAX_ARG_STRLEN. This is the case on ppc64el which has a page size of 64kB. Therefore the fix for the plugin would be something like:

--- icinga2-2.14.1.orig/lib/methods/pluginnotificationtask.cpp
+++ icinga2-2.14.1/lib/methods/pluginnotificationtask.cpp
@@ -13,17 +13,12 @@
 #include "base/convert.hpp"

 #ifdef __linux__
-#      include <linux/binfmts.h>
 #      include <unistd.h>
-
-#      ifndef PAGE_SIZE
-//             MAX_ARG_STRLEN is a multiple of PAGE_SIZE which is missing
-#              define PAGE_SIZE getpagesize()
-#      endif /* PAGE_SIZE */
+#      define ARG_MAX sysconf(_SC_ARG_MAX)

 // Make e.g. the $host.output$ itself even 10% shorter to leave enough room
 // for e.g. --host-output= as in --host-output=$host.output$, but without int overflow
-const static auto l_MaxOutLen = MAX_ARG_STRLEN - MAX_ARG_STRLEN / 10u;
+const static auto l_MaxOutLen = ARG_MAX - ARG_MAX / 10u;
 #endif /* __linux__ */

 using namespace icinga;

Then the second issue, is that the test hard codes the way to trigger the issue by using a 1024*1024 size.

aurel32 avatar Jan 03 '24 11:01 aurel32

@aurel32,

That looks like a good patch for Debian to carry until the Icinga folks determine their path forward. It will clear FTBFS on two platforms. And coupled with the Boost fix that @julianbrost provided, that takes the total to three platforms.

I hate to see armel and armv7 left behind, but I think its good progress none the less.

noloader avatar Jan 03 '24 12:01 noloader

The patch is only part of the fix, just the the plugin, not the testsuite. Also playing more with this, I noticed the problem is more complicated.

E2BIG happens when:

  • One argument is bigger than MAX_ARG_STRLEN
  • All arguments + environment is bigger than sysconf(_SC_ARG_MAX)

The default value of sysconf(_SC_ARG_MAX) on all modern Linux systems seems to be 2MiB. The value of MAX_ARG_STRLEN, changes with the page size, so it's 128KiB on systems with 4KiB pages (e.g. x86), 512KiB on systems with 16KiB pages (e.g. some mips systems), and 2MiB on systems with 64KiB pages (e.g. ppc64el).

So on an x86 system you can pass up to 16 arguments of length MAX_ARG_STRLEN, while on ppc64el, you can pass only one.

The current code and testsuite does not look at the number of arguments and assumes that MAX_ARG_STRLEN is the only limit.

aurel32 avatar Jan 03 '24 17:01 aurel32

So looks like no lock-free implementation of std::atomic<uint32_t> available on that platform:

https://github.com/Icinga/icinga2/blob/2c9117b4f71e00b2072e7dbe6c4ea4e48c882a87/lib/base/application.hpp#L147-L148

If someone wants to make it work, go ahead, but I'm not sure if it's worth the time for some old ARM systems.

What about locking instead of requiring a lock-free number? https://github.com/Icinga/icinga2/pull/8430/files#diff-f60819a3bc65b59dde8810c55c02f627b457f605fd27445e523fbd44618c1f10R157

Al2Klimov avatar Jan 04 '24 12:01 Al2Klimov

What about locking instead of requiring a lock-free number? #8430 (files)

The nice thing about atomic operations is that they can't block indefinitely. Using shared memory locking instead just adds a lot more room for error, especially given that we do so nice things like the following:

https://github.com/Icinga/icinga2/blob/2c9117b4f71e00b2072e7dbe6c4ea4e48c882a87/lib/base/application.cpp#L120

So if that would happen while that process held the lock, it would remain locked indefinitely.

Is there really a lack of 32 bit atomic operations on these architectures or is it just that those are not implemented in std::atomic for these platforms or that they have additional memory alignment requirements so that the always part of the check is the problem?

julianbrost avatar Jan 16 '24 09:01 julianbrost