netatalk icon indicating copy to clipboard operation
netatalk copied to clipboard

netatalk2: Intermittent unit test failures

Open VorpalBlade opened this issue 1 year ago • 12 comments

Describe the bug

make[4]: Entering directory '/home/arvid/src/aur/netatalk2/src/build/test/afpd'
PASS: test.sh
FAIL: test
============================================================================
Testsuite summary for netatalk 2.4.4
============================================================================
# TOTAL: 2
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See test/afpd/test-suite.log for debugging.
============================================================================

The log mentioned is quite cryptic to me:

==============================================
   netatalk 2.4.4: test/afpd/test-suite.log
==============================================

# TOTAL: 2
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

System information (uname -a): Linux 6.9.10-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Thu, 18 Jul 2024 18:05:52 +0000 x86_64
Distribution information (/etc/os-release):
NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="38;2;23;147;209"
HOME_URL="https://archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"

.. contents:: :depth: 2

FAIL: test
==========

fopen: No such file or directory
Jul 22 21:44:42.330220 [727438] {afp_config.c:247} (E:AFPDaemon): main: atp_open: Cannot assign requested address
Jul 22 21:44:42.330900 [727438] {dsi_tcp.c:349} (E:DSI): dsi_tcp_init: no suitable network config for TCP socket
Jul 22 21:44:42.330938 [727438] {afp_config.c:351} (E:AFPDaemon): main: dsi_init: Permission denied
Initializing
============
Testing: setuplog("default log_note /dev/tty") ...                                            [ok]
Testing: afp_options_init(&default_options) ...                                               [ok]
Testing: afp_options_parse( ARGNUM, args, &default_options) ...                               [ok]
Testing: configs = configinit(&default_options) ...                                           [error]
FAIL test (exit status: 1)

The system I'm building on does not have netatalk2 running (or even installed) so I don't know what this could be about. I am building with -j30 and distributed with distcc, so maybe there is a race condition between tests?

To Reproduce Steps to reproduce the behavior.

Expected behavior A clear and concise description of what you expected to happen.

Environment

  • Server OS: Arch Linux
  • Client OS N/A
  • Netatalk Version 2.4.4

Logs Um, not sure which ones would be relevant here.

Additional context If it is a crash, please attach a stacktrace.

VorpalBlade avatar Jul 22 '24 19:07 VorpalBlade

Hm:

  • If it is a race condition, it is awfully reproducible (100%).
  • This only happens when building with makepkg to build a package though. That in itself doesn't add sandboxing, it is just the executor for the package build instructions that Arch Linux uses.
  • It happens even if I don't sandbox the build (which is done with extra layers on top of makepkg).
  • Outside makpkg I can't reproduce it even with high -j.
  • 2.4.0 didn't have this issue (I haven't tried versions in between, have been too busy to keep up with this). Going back and rebuilding 2.4.0 it now also fails. So something external to netatalk2 failed and caused this. Fun.
  • This is still autotools, I haven't had time to convert to meson yet (oops should have mentioned that in the original post).

I don't know the code of this project at all, so I need some help with what suggestions to try next.

VorpalBlade avatar Jul 22 '24 19:07 VorpalBlade

The tests on both 2.x and 3.x have gotten unreliable on Linux specifically over the last few weeks, and I've not been able to figure out the root cause yet since the error messages and logs are not helpful. Like you say, it's almost as if something external to netatalk has changed and is interfering with the tests.

Two examples: https://github.com/Netatalk/netatalk/issues/1196 (Debian) https://github.com/Netatalk/netatalk/issues/1273 (Arch)

rdmark avatar Jul 22 '24 23:07 rdmark

On my desktop (AMD Zen 3) it managed to build in makepkg, on my laptop (Intel Skylake) it fails.

Might be random though, haven't run it enough times to know.

VorpalBlade avatar Jul 22 '24 23:07 VorpalBlade

Maybe time to run valgrind / ASAN / UBSAN / TSAN if you haven't already done so. A newer compiler or similar (especially likely if it affects only rolling release distros) could easily expose latent issues.

VorpalBlade avatar Jul 22 '24 23:07 VorpalBlade

That's a good idea. TBH I've never used C at this level but it's a good learning experience.

The odd thing, though, is that my main dev machine is affected which is a very stable Debian Bookworm system, so it seems unlikely that a new compiler version would have been pushed out... While a VM on my MacBook running the exact same Debian Bookworm version is not... Both x86_64 architecture. There is some minute environmental difference here that I haven't figured out yet.

rdmark avatar Jul 22 '24 23:07 rdmark

It's worth noting that the tests are passing with 2.4.x code in the Arch job in our GitHub CI workflow:

https://github.com/Netatalk/netatalk/actions/runs/10053455499/job/27786261358

rdmark avatar Jul 23 '24 08:07 rdmark

It's worth noting that the tests are passing with 2.4.x code in the Arch job in our GitHub CI workflow

It seems to be very flaky for sure.

VorpalBlade avatar Jul 23 '24 09:07 VorpalBlade

I'd be very curious to see if the same issue happens if you get around to setting up the Meson build system in the same environment.

rdmark avatar Jul 23 '24 12:07 rdmark

I'd be very curious to see if the same issue happens if you get around to setting up the Meson build system in the same environment.

I want to, I just have limited time and energy at the moment.

VorpalBlade avatar Jul 23 '24 12:07 VorpalBlade

@VorpalBlade Random idea: What happens if you build the entire package with -j1 (apart from taking a long long time)? With Meson, I was able to work around this by forcing sequential and single-threaded execution of the tests.

And yes, the tests should be rewritten and modernized. Something for a rainy day. :)

rdmark avatar Jul 26 '24 12:07 rdmark

@VorpalBlade Would you have the opportunity to look at this again any time soon?

Just a heads-up that we're working towards a netatalk4 release now, which will obsolete both netatalk2 and netatalk3. So there's a chance that this bug is moot. :)

rdmark avatar Sep 18 '24 09:09 rdmark

I have not seen it with meson. That said, it wasn't 100 % reproducible with autoconf, but I seem to remember to happened more often than not there, it is probably fixed when using meson

VorpalBlade avatar Sep 18 '24 16:09 VorpalBlade

Closing this as won't-fix for now. Please reopen if you encounter the same issue with netatalk 4.0 or later!

rdmark avatar Oct 13 '24 09:10 rdmark