seL4 icon indicating copy to clipboard operation
seL4 copied to clipboard

investigate SCHED0011 failure on odroid_xu4

Open lsf37 opened this issue 1 year ago • 8 comments

This could be just a fluke, but I have not seen this failure before, so we should investigate. The scheduler accuracy test failed for config ODROID_XU4_debug_MCS_clang_32:

  	<testcase classname="sel4test" name="SCHED0011">
  Running test SCHED0011 (Test scheduler accuracy)
  		<error>Check diff(103054000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103034000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103056000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103054000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103050000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103050000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103058000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103044000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103041000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  		<error>Check diff(103039000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
  Test SCHED0011 failed
  		<failure type="failure">result == SUCCESS at line 296 of file /github/workspace/projects/sel4test/apps/sel4test-driver/src/testtypes.c</failure>
  		<error>result == SUCCESS at line 217 of file /github/workspace/projects/sel4test/apps/sel4test-driver/src/main.c</error>
  	</testcase>

The corresponding chunk of code in the test is here:

        uint64_t start = sel4test_timestamp(env);
        seL4_Yield();
        uint64_t end = sel4test_timestamp(env);
        /* calculate diff in ns */
        uint64_t diff = (end - start);
        uint64_t period_ns = period * NS_IN_US;
        if (i > 0) {
            test_geq(diff, period_ns - 2 * NS_IN_MS);
            test_leq(diff, period_ns + 2 * NS_IN_MS);
            if (diff > period_ns) {
                ZF_LOGD("Too late: by %llu us", diff - period_ns);
            } else if (diff < period_ns) {
                ZF_LOGD("Too soon: by %llu us", period_ns - diff);
            }
        }

The bit that fails is test_leq(diff, period_ns + 2 * NS_IN_MS);

lsf37 avatar Nov 09 '23 21:11 lsf37

The seL4_Yield() from line 1016 could move into the loop (see https://github.com/seL4/sel4test/blob/master/apps/sel4test-tests/src/tests/scheduler.c#L1016-L1023) to guarantee we have a full time slice to get the time? This might avoid a race condition, but that should rarely hit. What seem odd is that it's constantly slightly more than 3 ms. Give the board is fast enough, I'd expect to see not 103.044.000us but something like 100.044.000us there.

axel-h avatar Nov 09 '23 22:11 axel-h

Moving the yield won't help this particular issue though, as the time is too long, not too short.

Do we even now for sure that the clock used for scheduling and the clock used for timing are correct and synchronous enough to make valid measurements?

This is again on the somewhat dodgy odroidxu4_1, can't we use odroidxu4_2 for CI instead?

Looking at the raw log I'm not convinced that something weird isn't going on:

2023-11-09T08:15:27.7834661Z Running test SCHED0011 (Test scheduler accuracy)
2023-11-09T08:15:28.4092443Z 		<error>Check diff(103054000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:28.8247532Z 		<error>Check diff(103034000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:29.2414247Z 		<error>Check diff(103056000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:29.6576481Z 		<error>Check diff(103054000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:30.0737663Z 		<error>Check diff(103050000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:30.4905678Z 		<error>Check diff(103050000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:30.9075889Z 		<error>Check diff(103058000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:31.3243316Z 		<error>Check diff(103044000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:31.7411949Z 		<error>Check diff(103041000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:32.1575577Z 		<error>Check diff(103039000) <= period_ns + 2 * (1000ULL * 1000ULL)(102000000) failed. at line 1026 of file /github/workspace/projects/sel4test/apps/sel4test-tests/src/tests/scheduler.c</error>
2023-11-09T08:15:32.1577427Z Test SCHED0011 failed
2023-11-09T08:15:32.1579569Z 		<failure type="failure">result == SUCCESS at line 296 of file /github/workspace/projects/sel4test/apps/sel4test-driver/src/testtypes.c</failure>
2023-11-09T08:15:32.1581885Z 		<error>result == SUCCESS at line 217 of file /github/workspace/projects/sel4test/apps/sel4test-driver/src/main.c</error>
2023-11-09T08:15:32.1585043Z 	</testcase>
2023-11-09T08:15:32.1586060Z 	<testcase classname="sel4test" name="SCHED0012">

The period is 100 ms, so the whole test should take about 1 second, it takes more than 4 seconds instead.

A successful run looks like:

2023-11-08T16:15:30.2571784Z Running test SCHED0011 (Test scheduler accuracy)
2023-11-08T16:15:31.2987898Z Test SCHED0011 passed

Indanz avatar Nov 11 '23 10:11 Indanz

FYI, I've seen this test fail when LibUtilsDefaultZfLogLevel is increased. I assume the extra prints take long enough to exceed the margin.

canarysnort01 avatar Nov 11 '23 12:11 canarysnort01

The printing happens after the measurement in each loop, so this wont affect it.

axel-h avatar Nov 11 '23 12:11 axel-h

I've definitely seen that flag reliably trigger this failure on another platform. If that's not supposed to happen I'll repro it again and file an issue about it.

canarysnort01 avatar Nov 11 '23 12:11 canarysnort01

I also think some of the timer drivers in libplatsupport might introduce some rounding error when converting from/to timer ticks and nanoseconds because they use a pre-calculated number of nanoseconds per timer tick, which due to using integer arithmetic is truncated to a whole integer.

I don't know if this is a problem on this platform or if it would be significant enough to cause any issues, but it seems to me this small error at the nanosecond scale might get large for longer time periods.

canarysnort01 avatar Nov 11 '23 13:11 canarysnort01

The printing happens after the measurement in each loop, so this wont affect it.

Ah, the failure I saw was in a different test, SCHED0021.

canarysnort01 avatar Nov 11 '23 18:11 canarysnort01

The timestamp is added by the logging system and includes the loop time, so it just shows a ridiculously slow (buggy?) UART driver (2 ms per character).

The quality of the timer drivers in libplatsupport is low, but mostly bad overflow handling (I have a local patch to fix it, but didn't get to it yet). The meson timer assumes an 1 MHz clock it seems, if a rounding error is unlikely.

According to the schematics, there's only a 24MHz clock and a 32kHz clock input, so it's unlikely that a different clock source is being used either.

3% overhead for "something" is ridiculous.

Indanz avatar Nov 11 '23 19:11 Indanz