gramine icon indicating copy to clipboard operation
gramine copied to clipboard

LTP test gettimeofday02 failing intermittently with GSGX

Open aniket-intelx opened this issue 4 years ago • 12 comments

Description of the problem

The testcase gettimeofday02 is failing intermittently on OOT (kernel : 5.0.0-23-generic) and DCAP (kernel : 5.9.0-intel-next+) with Ubuntu 18.04 machines with SGX. Log files for both the instances, pass and fail are attached. For convenience purpose, removed duplicate lines shim_gettimeofday(0x7adac80, 0x0) = 0x0 gettimeofday02_TFAIL.txt gettimeofday02_TPASS.txt

Steps to reproduce

Run below mentioned command graphene-sgx gettimeofday02

Expected results

TPASS: gettimeofday() monotonous in 10 seconds

Actual results

TFAIL: Time is going backwards: old 1632371655.582390 vs new 1632371655.582387!

aniket-intelx avatar Sep 23 '21 05:09 aniket-intelx

Why do you say "on OOT and DCAP"? Does it matter? Have you tested on other systems, and there this test never fails at all?

dimakuv avatar Sep 23 '21 09:09 dimakuv

Dmitrii, these machines were enabled recently in the CI and we see the issues only on them. On all other 5.12, 5.13 configurations we don't see any issues with this test. Also, to add the test consistently passes with Graphene native and Linux native and the intermittent failures are only seen with GSGX.

jinengandhi-intel avatar Sep 23 '21 09:09 jinengandhi-intel

Let us know if you need access to the machine for a quick check. We can walk through the issue and the workspace.

jinengandhi-intel avatar Sep 23 '21 09:09 jinengandhi-intel

@aneessahib You changed the "recalibrate" time PR recently, right?

I think what happens is that your PR triggers the "recalibration" (== don't calculate the time from RDTSC but instead read from the Linux host) more often now, every 50ms (or smth like this). And we don't have the logic to verify the read-from-Linux-host timestamp against the previous timestamp. So it is possible periodically that the previous timestamp (calculated via RDTSC) is greater than the current timestamp (read from Linux host), and this LTP test fails.

I suggest we add a code snippet like this:

current_timestamp = <from RDTSC or from Linux-host>;
if (current_timestamp < previous_timestamp);
    current_timestamp = previous_timestamp; /* can't go back in time! */
previous_timestamp = current_timestamp;

dimakuv avatar Sep 23 '21 09:09 dimakuv

We can run this test with the commit and one before that to confirm, if this is indeed the commit causing the issue.

jinengandhi-intel avatar Sep 23 '21 09:09 jinengandhi-intel

The test executes successfully when tried with one commit before, i.e. 39e117b719a052408b94afc0e9152793765eccd5 while it fails intermittently with commit 487969c1dc1b0f27096f9d56312552e3ab6b754d which is "Correct RDTSC instruction drift with system clock more often". Please find the attached log. Verify_gettimeofday02_correct_RDTSC_commit.txt

aniket-intelx avatar Sep 23 '21 11:09 aniket-intelx

For today's nightly analysis, we saw this issue for Ubuntu 20.04 with 5.12 kernel and GSGX

aniket-intelx avatar Sep 24 '21 09:09 aniket-intelx

@dimakuv - i think what you are saying is already being checked. see below.

            if (diff_usec < TSC_REFINE_INIT_TIMEOUT_USECS) {
                /* less than TSC_REFINE_INIT_TIMEOUT_USECS passed from the previous update of
                 * TSC/usec pair (time drift is contained), use the RDTSC-calculated time */
                usec = start_usec + diff_usec;
                if (usec < start_usec)
                    return -PAL_ERROR_OVERFLOW;
            }

we could return usec = start_usec; instead of overflow. But the logs doesnt seem to show it getting into this path. But let me know if this is what you are talking about.

aneessahib avatar Sep 24 '21 09:09 aneessahib

@aneessahib No, I am talking about this part (when we fallback to the OCALL): https://github.com/gramineproject/gramine/blob/487969c1dc1b0f27096f9d56312552e3ab6b754d/Pal/src/host/Linux-SGX/db_misc.c#L114

Here we just return the timestamp that was returned by ocall_gettime() (and slightly re-calculated afterwards).

What I would do is a sanity adjustment like this:

...
    uint64_t tsc_cyc = tsc_cyc1 + (tsc_cyc2 - tsc_cyc1) / 2;
    if (tsc_cyc < tsc_cyc1)
        return -PAL_ERROR_OVERFLOW;

    /* --- my proposed code starts --- */
    static uint64_t previous_usec = 0;
    if (usec < previous_usec) {
        /* new timestamp is "back in time" than the previous timestamp; this can happen
         * if RDTSC time drifted backwards compared to actual host time */
         usec = previous_usec;
    }
    previous_usec = usec;
    /* --- my proposed code ends --- */

    /* refresh the baseline data if no other thread updated g_start_tsc */
    write_seqbegin(&g_tsc_lock);

dimakuv avatar Sep 24 '21 10:09 dimakuv

@jinengandhi-intel @aniket-intelx @aneessahib Is this still relevant?

dimakuv avatar Nov 25 '21 07:11 dimakuv

I actually could not work on this any further after the linked issue was sorted out. @jinengandhi-intel - are we continuing to see this problem?

aneessahib avatar Nov 25 '21 08:11 aneessahib

This testcase is still failing intermittently across various configurations. Some of them are listed below.

  1. DCAP with kernel 5.9.0-intel-next+
  2. RHEL with kernel 5.13.0
  3. CentOS 8 with kernel 5.14.17
  4. Ubuntu 20.04 with kernel 5.12.0-051200-generic

aniket-intelx avatar Nov 25 '21 08:11 aniket-intelx

I cannot reproduce this issue of intermittent failures using Gramine latest master (https://github.com/gramineproject/gramine/commit/cd6a9cca9585110a9bcd5c63dcc75b5c4d49466b) on Ubuntu 20.04 with kernel 6.2.0-rc2 or 5.15.0-58-generic.

@jinengandhi-intel @aniket-intelx @aneessahib Is this still relevant? Thanks!

@dimakuv The proposed sanity check still looks reasonable to me even w/o the issue reproduction, do we want to move it forward? Thanks!

kailun-qin avatar Apr 10 '23 02:04 kailun-qin

@dimakuv The proposed sanity check still looks reasonable to me even w/o the issue reproduction, do we want to move it forward? Thanks!

I'd say this could be even security-relevant? The app running inside the enclave may go crazy if time teleports back (well written apps shouldn't, but I don't think that people know that most OSes have time APIs which don't guarantee monotonicity).

mkow avatar Apr 10 '23 13:04 mkow

@jinengandhi-intel Do we know if this repros on master?

aneessahib avatar Apr 10 '23 14:04 aneessahib

Intermittently we see it at least a few times every week. I checked the data for last week and we saw it once for Debian11 and most other times its for Ubuntu18.04. One more thing, the issue was first created and found on older kernels like 5.12 and 5.13 but since then we have moved to newer kernels like 5.19 and 6.0 and we still see the issue.

jinengandhi-intel avatar Apr 10 '23 15:04 jinengandhi-intel