machinekit-hal icon indicating copy to clipboard operation
machinekit-hal copied to clipboard

hm2_idrom test fails nondeterministically

Open zultron opened this issue 5 years ago • 1 comments

This happens often on Drone CI, as reported by @cerna. Here is an example, but it doesn't show any information to debug.

zultron avatar Aug 12 '20 21:08 zultron

I can reproduce this usually with about 100-200 repeated runs of the test on the RPi4 I've been using.

Each time, a halcmd ping leaves a core dump in the test directory; halcmd ping is run by realtime as part of starting up. gdb shows it's segfaulting while trying to read something I think is from a shm segment.

I added commands to test.sh to show the processes running in the container, and in the latest failure, runtime.log shows three msgd:0 and one rtapi:0. This leads me to believe there's a race condition between the realtime script starting up, and a previous RTAPI session shutting down: perhaps the halcmd ping attaches to the old shm segment just before it is removed by the previous session, resulting in a segfault.

While that seems like a problem whose underlying cause needs closer attention (why should halcmd be allowed to attach to a segment that's being shut down?), a simpler fix would be to improve realtime stop to ensure that all rtapi:0 and msgd:0 processes are completely shut down and all shm segments are cleaned up.

zultron avatar Aug 31 '20 02:08 zultron