packages atlas-sw-probe: remains hung due to too early start since 23.05?

Maintainer: @ja-pa @BKPepe Environment: x86_64 / PC Engines APU2 / OpenWrt 23.05.3

Hello,

my atlas (atlas-sw-probe) probe currently remains "hung" after rebooting:

Atlas starts before the WAN link is up, and remains in a hung state, the probe appears to be online/connected at RIPE, however it does not fulfil any measurements at all (which makes this particularly hard to notice).

Restarting Atlas when the WAN link is already up works around this issue, however when rebooting next time, the issue appears again.

It's possible this issue is happening since upgrading from 22.03 to 23.05, however I do not have 100% certainty on that.

Example:

Mon Apr  8 12:00:47 2024 daemon.err ATLAS[2221]: evping: bad address 'yyy.xxxxxxxxxxxxxx.sos.atlas.ripe.net'
Mon Apr  8 12:00:47 2024 daemon.err ATLAS[2221]: evping_ops.init failed
[...]
Mon Apr  8 12:00:59 2024 daemon.notice netifd: Interface 'wan' is now up
Mon Apr  8 12:00:59 2024 user.notice firewall: Reloading firewall due to ifup of wan (pppoe-wan)
[...]
Mon Apr  8 12:03:48 2024 daemon.err ATLAS[2221]: condmv: not moving, destination '/usr/libexec/atlas-probe-scripts/data/out/v6addr.txt' exists
Mon Apr  8 12:03:48 2024 daemon.err ATLAS[2221]: condmv: not moving, destination '/usr/libexec/atlas-probe-scripts/data/out/simpleping' exists
[...]
Mon Apr  8 12:06:51 2024 daemon.err ATLAS[2221]: condmv: not moving, destination '/usr/libexec/atlas-probe-scripts/data/out/v6addr.txt' exists
Mon Apr  8 12:06:51 2024 daemon.err ATLAS[2221]: condmv: not moving, destination '/usr/libexec/atlas-probe-scripts/data/out/simpleping' exists

The condmv errors reappear every 3 minutes as above.

After 1 hour uptime I can see a eperd error as below:

Mon Apr  8 13:00:42 2024 local6.err eperd[2425]: root: No such file or directory

The condmv errors keep reappearing every 3 minutes.

Is atlas supposed to automatically be restarted on a WAN up event?

I noticed that the busybox binary dates back to Jan 15th despite this being 23.05.03, but I assume this is a non problem?

root@up-rt1:/usr/libexec/atlas-probe/bin# ls -lha
drwxr-xr-x    2 root     root        1.0K Mar 26 18:28 .
drwxr-xr-x    5 root     root        1.0K Mar 26 18:28 ..
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 atlasinit -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 buddyinfo -> busybox
-rwxr-xr-x    1 root     root      417.8K Jan 15  2023 busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 condmv -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 date -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 dfrm -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 eooqd -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 eperd -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 evhttpget -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 evntp -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 evping -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 evsslgetcert -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 evtdig -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 evtraceroute -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 httppost -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 onlyuptime -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 perd -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 rchoose -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 rptaddrs -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 rptra6 -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 rptuptime -> busybox
lrwxr-xr-x    1 root     root           7 Mar 26 18:28 rxtxrpt -> busybox
root@up-rt1:/usr/libexec/atlas-probe/bin#

Thank you,

Lukas

Apr 08 '24 12:04 lukastribus

Try START=99 in init script then re-enable service, it is meant to be ran on configured network.

Apr 08 '24 15:04 brada4

network has 20, atlas has 30. Bumping atlas from 30 to 99 did not solve this issue.

Network will be configured before starting atlas, but that doesn't mean that the WAN link will be fully initialized, up and running.

I tried reproducing the problem on a full VM (Oracle EL8 - similar to CentOS8) and the problem does not happen there. Simulated by dropping all outgoing packets with iptables (except my ssh session).

I retract the statement that it may have worked in 22.03, I think I never actually ran atlas on 22.03.

I'm working around this for now by triggering the init.d script on ifup instead of startup:

/etc/init.d/atlas disable

cat << "EOF" > /etc/hotplug.d/iface/99-atlas-on-demand
[ "${ACTION}" = "ifup" ] && [ "${DEVICE}" = "pppoe-wan" ] && {
    logger -t hotplug "Device: ${DEVICE} / Action: ${ACTION} starting atlas from 99-atlas-on-demand"
    /etc/init.d/atlas start
}
[ "${ACTION}" = "ifdown" ] && [ "${DEVICE}" = "pppoe-wan" ] && {
    logger -t hotplug "Device: ${DEVICE} / Action: ${ACTION} stopping atlas from 99-atlas-on-demand"
    /etc/init.d/atlas stop
} 
EOF

Apr 08 '24 19:04 lukastribus

@lukastribus we should really investigate this... afaik our script lack support for reboot_probe function... Does your VM use the same scripts used by openwrt or use the VM one? (that have different implementation?)

Apr 28 '24 10:04 Ansuel

On the VM it's handled by systemd, so it's quite different.

Apr 29 '24 18:04 lukastribus