machinekit-hal icon indicating copy to clipboard operation
machinekit-hal copied to clipboard

threads.0 test fails nondeterministically

Open zultron opened this issue 7 years ago • 14 comments

The threads.0 test fails randomly & nondeterministically in Jenkins; here's an example. The error message:

Running test: tests/threads.0
line 924: got 0, expected 10 or 1
*** tests/threads.0: FAIL: checkresult exited with 1

Looking at the test and the related comp, I'm trying to explain these results. The sampler.0 function runs after the threadtest.0.increment function in the fast thread, and the threadtest.0.increment function always increments count, so how could count ever be zero?

Serious bugs, such as memory barrier leaks, compiler optimizations or RT priority problems might be possible, but hopefully this is actually expected behavior.

I still need a 101 primer on multicore, but wonder if the fast and slow threads are running on different CPUs in Jenkins (this theory is moot if @ArcEye says the Jenkins host is single CPU). If the reset function in the slow thread happened to clear count right in between the increment and sampler functions in the fast thread, sampler could see a zero value. The chance that this would happen on any one cycle is tiny, but likely increased by the additional jitter in the non-RT Jenkins test environment, and with 350 chances (in 3500 samples) per run, this could explain the low frequency, nondeterministic pattern of failures.

If this is the case, the zero would be followed by normally incrementing integers. The checkresult script could be updated to allow this to stop the test failures. Alternatively, a more sophisticated test could detect the condition by running an additional function after sampler, and making the reset function spin on a mutex-like pin that stays set while sampler runs; this would have the advantage of distinguishing between this normal behavior and bad behavior caused by other bugs.

zultron avatar Jan 09 '19 04:01 zultron

AFAIK it is a multicore amd64 server. Not sure however if core allocation is different at times of load from multiple package builds etc.

Didn't we determine that this only seems to happen in Jessie builds however?

ArcEye avatar Jan 09 '19 11:01 ArcEye

@ArcEye, I don't know the history of this error like you would. Is that true? Do you think we ought to enable tests for both Jessie and Stretch to be sure? That would certainly leave a hole in my theory.

zultron avatar Jan 14 '19 08:01 zultron

@ArcEye, over at machinekit/machinekit#1442 you said

@zultron at some point I am sure you said that the threads.0 test error only occurred in Jessie.

I may have said that, but I sure don't remember. Is that additional reason to enable Stretch tests, too?

zultron avatar Jan 14 '19 09:01 zultron

Jessie is EOL next year anyway, so no harm switching to Stretch and seeing what happens.

When I tested some while back, I didn't get this error on Stretch, Buster or sid and had to revert to Jessie to experience it.

It is just possible it is related to compiler version and full atomic operations compliance regards threads?

Wheezy certainly did not have much atomic operation support, which was part of the reason why we dropped it as soon as possible.
I thought Jessie was OK, but it has to be something like this and compiler is an obvious place to look.

ArcEye avatar Jan 14 '19 09:01 ArcEye

Over in machinekit/machinekit#1442, @ArcEye said tests were switched back to Stretch already. This means my theory is still valid, and it's still worthwhile to implement a simple test inside the threadtest comp that will prove or disprove it.

If disproven, then yes, I'd want next to start looking at atomic operations, memory barrier problems and such, but I'm just hoping those more complex, more serious issues aren't the problem.

zultron avatar Jan 15 '19 01:01 zultron

This error started happening more often after implementation of support for multiple CI providers, specifically most often in Travis CI on arm64 runners building the Debian Buster version.

Much less often on amd64 Debian Buster test build.

Given the weaker memory model on ARM architecture, it is something which should be taken into account and monitored in the future.

I will try to keep the "statistic" on this.


19-08-2020, pull request #298, Travis CI, ARM64 Debian Buster runtests job failed on : tests/threads.0 (107.2) 19-08-2020, pull request #298, Travis CI, ARM64 Debian Buster runtests job failed on : tests/threads.0 (107.2) (second time) 28-08-2020, pull request #303, Travis CI, ARM64 Debian Buster runtests job failed on : tests/threads.0 (118.2) 28-08-2020, pull request #303, Travis CI, ARM64 Debian Stretch runtests job failed on : tests/threads.0 (118.4) 28-08-2020, pull request #303, Travis CI, ARM64 Debian Buster runtests job failed on : tests/threads.0 (126.2) 29-08-2020, merge of PR #305, Travis CI, ARM64 Debian Buster runtests job failed on : tests/threads.0 (130.2) 31-08-2020, merge of PR #307, Travis CI, ARM64 Ubuntu Focal runtests job failed on : tests/threads.0 (137.2) 31-08-2020, pull request #309, Travis CI, ARM64 Debian Stretch runtests job failed on : tests/threads.0 (138.8) 01-09-2020, merge of PR #309, Travis CI, ARM64 Debian Stretch runtests job failed on : tests/threads.0 (139.8) 04-09-2020, pull request #312, Travis CI, ARM64 Ubuntu Bionic runtests job failed on : tests/threads.0 (142.6) 05-09-2020, merge of PR #312, Travis CI, ARM64 Debian Stretch runtests job failed on : tests/threads.0 (143.8) 05-09-2020, merge of PR #312, Travis CI, ARM64 Debian Stretch runtests job failed on : tests/threads.0 (143.8) (second time) 11-09-2020, merge of PR #316, Travis CI, ARM64 Ubuntu Focal runtests job failed on : tests/threads.0 (153.2) 29-09-2020, merge of PR #319, Travis CI, ARM64 Debian Stretch runtests job failed on : tests/threads.0 (160.8) 30-09-2020, merge of PR #320, Travis CI, ARM64 Ubuntu Bionic runtests job failed on : tests/threads.0 (162.6) 09-10-2020, pull request #313, Travis CI, ARM64 Ubuntu Focal runtests job failed on : tests/threads.0 (164.2) 09-10-2020, merge of PR #313, Travis CI, ARM64 Debian Buster runtests job failed on : tests/threads.0 (165.4) 09-10-2020, pull request #323, Travis CI, ARM64 Debian Stretch runtests job failed on : tests/threads.0 (166.8) 18-10-2020, pull request #325, Travis CI, ARM64 Ubuntu Focal runtests job failed on : tests/threads.0 (169.2) 18-10-2020, pull request #325, Travis CI, ARM64 Ubuntu Bionic runtests job failed on : tests/threads.0 (169.6) 18-10-2020, pull request #325, Drone Cloud, AMD64 Debian Stretch runtests job failed on: tests/threads.0 (99.10) 19-10-2020, merge of PR #325, Travis CI, ARM64 Debian Buster runtests job failed on : tests/threads.0 (170.4) 20-10-2020, merge of PR #327, Travis CI, ARM64 Ubuntu Focal runtests job failed on : tests/threads.0 (172.2) 20-10-2020, merge of PR #327, Travis CI, ARM64 Debian Buster runtests job failed on : tests/threads.0 (172.4) 20-10-2020, merge of PR #327, Travis CI, ARM64 Ubuntu Bionic runtests job failed on : tests/threads.0 (172.6) 20-10-2020, merge of PR #327, Travis CI, ARM64 Debian Stretch runtests job failed on : tests/threads.0 (172.8) 18-11-2020, merge of PR #332, Travis CI, ARM64 Ubuntu Focal runtests job failed on : tests/threads.0 (178.2) 18-11-2020, merge of PR #332, Travis CI, ARM64 Debian Buster runtests job failed on : tests/threads.0 (178.4) 18-11-2020, merge of PR #332, Travis CI, ARM64 Debian Stretch runtests job failed on : tests/threads.0 (178.8)

cerna avatar Aug 14 '20 21:08 cerna

The original purpose of this test in the LCNC project was to demonstrate that HAL threads (which run on a single CPU; LCNC-HAL doesn't have multi-core) have rate-monotonic scheduling, i.e. the slow thread will never preempt the fast thread. I've done some work to show that these test failures are due either to lack of rate-monotonic scheduling, when threads fail to set SCHED_FIFO, or else because they are running in parallel on multiple CPUs, and scheduling is not to blame.

The short of it is, to make the test reliable, we need at least three things:

  • The rtapi_app needs to fail when SCHED_FIFO can't be set so that we know to fix SYS_CAP_NICE privileges
  • Docker containers must be run with add-cap=sys_nice for SCHED_FIFO privileges to be available
  • The test must be updated to run both threads on the same CPU to prevent the appearance of preemption

Instrumenting the test to further prove the slow thread preempts the fast

I added two types instrumentation along the lines of https://github.com/machinekit/machinekit-hal/issues/173#issue-397211641 and ran it on a single core machine.

Mutex: The original fast thread function sets a mutex, the sampler function executes, and then a (new) second function on the fast thread clears the mutex. The original reset function on the slow thread checks whether the mutex is set; the intention is to further prove that the count pin is reset to zero because the slow thread is preempting the fast thread.

Timestamps: Each of the three functions now writes its start and stop time on new pins. Each of these three functions is followed by its own dedicated sampler instance that records the function's start and stop time, the mutex state, and of course the count. The checkresult massages all the timestamps from these three sources together, in order, where it again becomes plain from the respective start and stop timestamps when the slow thread has preempted the fast thread .

The reason I went through all this apparently redundant trouble was to convince myself that the failures weren't due to something else like memory barrier leakage, and now I am quite convinced.


Failures setting SCHED_FIFO

HAL implements rate-monotonic scheduling in POSIX threads by setting the SCHED_FIFO policy with higher priority to threads with a shorter period, and lower priority to threads with a longer period. These test failures show that failures are due to broken scheduling (when limited to a single CPU).

The main reason for failures is no CAP_SYS_NICE capability, required to set the SCHED_FIFO policy. The rtapi_app binary is setuid root partly to obtain this privilege (and also for PCI access). I happened to notice a log message pointing this out; unfortunately, whether it is printed out seems to depend on how realtime is run, so it was obscured for quite a while! This missing message is a problem with logging that needs to be fixed. That rtapi_app doesn't simply exit when setting SCHED_FIFO fails is another problem that needs to be fixed.

It turned out that I had forgotten to run my Docker container with --cap_add=sys_nice (or --privileged). Once this was corrected, the test began passing reliably. In the GitHub Actions workflows and Travis-CI the Docker containers aren't started with this capability, which will cause failures. I'm suspicious whether Drone CI even offers the possibility.


Running on multiple CPUs will cause failures

The other reason the test can fail is if both threads aren't running on the same CPU. HAL threads' CPU affinity feature should do the trick, although I have had trouble with it in the past (likely operator error).

Also to help diagnose this mode of failure, yet more instrumentation might be added to the test that shows which CPUs the threads are running on.

zultron avatar Sep 01 '20 03:09 zultron

I will try to comment on points I actually know the answer:

In the GitHub Actions workflows and Travis-CI the Docker containers aren't started with this capability, which will cause failures. I'm suspicious whether Drone CI even offers the possibility.

Github Actions workflows run in newly created VM, so there should be no problem with setting privileged mode or any capability on the container. Travis CI is a bit more interesting in this aspect, because the amd64 runners are also using full VM, but the arm64 runners are actually LXC/LXD containers. And from their blogpost, Travis is preparing to switch the amd64 runners also to the LXC/LXD. That has the downside that some operations are disallowed. (Manual - I have so far no idea if it is going to prevent setting the nice capability, probably best to test it.)

Drone Cloud is no-go. In Drone, you can run privileged containers, but not in the public Drone Cloud. It's too close to the real hardware. (This is also one of the problems [well, the main one actually] which prevent the creation of container in one run and using it in another without uploading it and downloading it from remote registry.)

Drone is nice that you have pretty much the whole server (shared with all actual load), so you are building on >60 processors machine and it is fast. But alas the limits of what you can do are palpable.

This missing message is a problem with logging that needs to be fixed. That rtapi_app doesn't simply exit when setting SCHED_FIFO fails is another problem that needs to be fixed.

You mean this error message? (BTW, this code does not look good, given the fact that tests run in flavour POSIX and that one has no FLAVOR_IS_RT set.) I don't know, I think it is the right place to report the error and I don't think that the rtapi_app should prematurely sound an alarm. If the logging or error notification is lost in transit then that should be repaired. (This is probably also connected to #126.)


On connected note, I have in my jotter the point that the condition that the first thread should have the smallest period and every next one should have always bigger one than the previous one should be removed. As it has no place in multi-core system where threads can be affixed to different cores. But this goes against assumption in this thread: The higher frequency thread has higher priority.

One way how to solve this and still keep this test and its assumptions (because threads.0 tests different thing in Machinekit-HAL and LinuxCNC) is to add thread priority abstraction to Machinekit's Runtime API, which will allow user to specify this value. The flavouring abstraction will then transform it to the running system under (basically POSIX sched call, both Vanilla kernel and all RT-kernels implement it as POSIX, but the possibility that some system in the future will not is still there).

cerna avatar Sep 01 '20 15:09 cerna

Hmm, when I set the function which adds new tasks to fail when realtime_set_priority() fails and also add the --cap-add=sys_nice to Docker run command which executes the debian/ripruntests.py then in Travis CI on amd64 runners everything passes OK, but on arm64 I get 48 failed tests. Which is what I was afraid of.

On the other hand, I consider it a bug that tests are run for a property which is not available on given flavour. In other words, if some tests require a SCHED_FIFO but given flavour does not necessary have to have ability to set SCHED_FIFO to run, then these tests should not run or the failure should be expected.

Maybe flavour should export this in metadata. (Another note to my long-running flavour rework.)


Other than that I will look if other CI services with amd32/arm64 do not offer VM based systems.

cerna avatar Sep 01 '20 17:09 cerna

This missing message is a problem with logging that needs to be fixed. That rtapi_app doesn't simply exit when setting SCHED_FIFO fails is another problem that needs to be fixed.

You mean this error message? (BTW, this code does not look good

Oops, I said that wrong. The message that was intermittently appearing was your second link, "to get non-preemptive scheduling with POSIX threads," etc. I'm seeing that again right now, and just tracked it down to syslog_async, another known problem that has plagued us for years now.

tests run in flavour POSIX and that one has no FLAVOR_IS_RT set

Yes, you've pointed out yet another problem! It's not only possible but necessary to set SCHED_FIFO and the sched_priority in non-RT POSIX threads. Since rate-monotonic scheduling is a requirement for now, then those MUST be set or rtapi_app should exit, complaining loudly. That logic needs to be fixed.

The main difference in how MK should handle non-RT threads is rather to ignore the RTP_DEADLINE_MISSED exception. Currently it reports it as "unspecified" but doesn't actually ignore it as it should. I also think that to avoid accidents, non-RT threads should never be automatically selected, and if the user doesn't explicitly ask for them, then rtapi_app should fail if no RT thread flavor is available.

the first thread should have the smallest period and every next one should have always bigger one than the previous one should be removed. As it has no place in multi-core system where threads can be affixed to different cores

This is the rate-monotonic scheduling implementation that MK inherited from LCNC, surely chosen because of the historically wide-spread use of bit-banging steppers over LPT in a high-frequency, jitter-sensitive stepper thread with servo thread on the same CPU. And that use case shows why rate-monotonic is a reasonable default strategy, even if someone does eventually introduce other scheduling policies.

So, the threads.0 test is still a valid and useful test, but only if the threads can be forced onto the same CPU with SCHED_FIFO priorities.

zultron avatar Sep 02 '20 02:09 zultron

[...] I'm seeing that again right now, and just tracked it down to syslog_async, another known problem that has plagued us for years now [...]

Great! Another long-standing problem starting to bite more and more.

Since rate-monotonic scheduling is a requirement for now, then those MUST be set or rtapi_app should exit, complaining loudly. That logic needs to be fixed.

It is probably just semantics, but I think the thread creation function should complain loudly, not the rtapi_app per se.

[...] The main difference in how MK should handle non-RT threads is rather to ignore the RTP_DEADLINE_MISSED exception. Currently it reports it as "unspecified" but doesn't actually ignore it as it should [...]

Didn't know that. That could be the error seen in QEMU-based running runtests.

[...] I also think that to avoid accidents, non-RT threads should never be automatically selected [...]

Yes, we talked about when I was presenting my flavour-as-a-module (separate .deb package) idea.

[...] And that use case shows why rate-monotonic is a reasonable default strategy, even if someone does eventually introduce other scheduling policies [...]

Sure, but you would probably be better off with locking the step-gen to separate core now.


I have created few questing about setting SCHED_FIFO policy on arm64 based runners in the currently used CIs and it does not look good. The Drone says that it is not possible outright, the Travis-CI people haven't answered yet, but based on other answers it doesn't look good.

I am trying to investigate Shippable, so we will see. I wonder if there is chance of getting an Arm64/Arm32 server from WorksOnArm...

cerna avatar Sep 02 '20 19:09 cerna

I started working on removing syslog_async, which was easy enough, but then found that none of rtapi_app's messages seem to be actually going through the message ring, and then got stuck on that for now. I'll have to deal with it later. (I probably broke that earlier by myself. :P)

you would probably be better off with locking the step-gen to separate core now

If you have more than one CPU, yes. But some hardware used in the community only has a single CPU, like low-end SoCs. This test is focused on the behavior of multiple threads with different priorities on a single CPU, which IMO should be supported.

The Drone says that it is not possible outright, the Travis-CI people haven't answered yet, but based on other answers it doesn't look good.

I'm surprised about Travis, since we get to run our own Docker container, but confirmed. I think for now we can just disable the test on Travis, since it's valuable to run the others. Later, we can decide if it's OK to relax the SCHED_FIFO requirement, perhaps as an option, so that Machinekit can run in unprivileged environments.

I've hit my time limit for this for now, but I'll get back to it soon.

zultron avatar Sep 03 '20 00:09 zultron

I'm surprised about Travis, since we get to run our own Docker container, but confirmed. I think for now we can just disable the test on Travis, since it's valuable to run the others. Later, we can decide if it's OK to relax the SCHED_FIFO requirement, perhaps as an option, so that Machinekit can run in unprivileged environments.

Well, I think that it is caused by apparmor profile the Travis people are using on the LXC/LXD platform. Modern CI services for Arm are just not used to dealing with project requiring low level Linux functionality. Much less the actual kernel patching (not just module loading). And one probably cannot blame them: The Majority of what people are testing are high level user space applications, JavaScript libraries or Electron apps. As far as I know nobody allows specifying own VM image, which would be a blessings to projects like Machinekit's ones.

[...] I started working on removing syslog_async, which was easy enough [...]

I had unclean thoughts in the past about replacing syslog_async external code with something more modern and actually developed. The last few years gave to creation of few very interesting libraries like mini-async-log-c, NanoLog and others. I am not saying that the fact these are newer means that they are better, but maybe the authors would be actually available to answer the questions originating from trying to push it onto Machinekit.

[...] which IMO should be supported [...]

Not trying to argue otherwise. But it would be nice if the multi-core functionality was a bit more in the spotlight.

cerna avatar Sep 03 '20 00:09 cerna

Travis-CI should now support arm64 based Virtual Machines as tests runners, not just LXD Containers as it was uptil now.

More in blogpost.

The best course of actions will be to let tests run in VM and just builds in LXD container.

cerna avatar Nov 18 '20 01:11 cerna