gramine
gramine copied to clipboard
LTP test gettimeofday02 failing intermittently with GSGX
Description of the problem
The testcase gettimeofday02 is failing intermittently on OOT (kernel : 5.0.0-23-generic) and DCAP (kernel : 5.9.0-intel-next+) with Ubuntu 18.04 machines with SGX.
Log files for both the instances, pass and fail are attached. For convenience purpose, removed duplicate lines shim_gettimeofday(0x7adac80, 0x0) = 0x0
gettimeofday02_TFAIL.txt
gettimeofday02_TPASS.txt
Steps to reproduce
Run below mentioned command
graphene-sgx gettimeofday02
Expected results
TPASS: gettimeofday() monotonous in 10 seconds
Actual results
TFAIL: Time is going backwards: old 1632371655.582390 vs new 1632371655.582387!
Why do you say "on OOT and DCAP"? Does it matter? Have you tested on other systems, and there this test never fails at all?
Dmitrii, these machines were enabled recently in the CI and we see the issues only on them. On all other 5.12, 5.13 configurations we don't see any issues with this test. Also, to add the test consistently passes with Graphene native and Linux native and the intermittent failures are only seen with GSGX.
Let us know if you need access to the machine for a quick check. We can walk through the issue and the workspace.
@aneessahib You changed the "recalibrate" time PR recently, right?
I think what happens is that your PR triggers the "recalibration" (== don't calculate the time from RDTSC but instead read from the Linux host) more often now, every 50ms (or smth like this). And we don't have the logic to verify the read-from-Linux-host timestamp against the previous timestamp. So it is possible periodically that the previous timestamp (calculated via RDTSC) is greater than the current timestamp (read from Linux host), and this LTP test fails.
I suggest we add a code snippet like this:
current_timestamp = <from RDTSC or from Linux-host>;
if (current_timestamp < previous_timestamp);
current_timestamp = previous_timestamp; /* can't go back in time! */
previous_timestamp = current_timestamp;
We can run this test with the commit and one before that to confirm, if this is indeed the commit causing the issue.
The test executes successfully when tried with one commit before, i.e. 39e117b719a052408b94afc0e9152793765eccd5 while it fails intermittently with commit 487969c1dc1b0f27096f9d56312552e3ab6b754d which is "Correct RDTSC instruction drift with system clock more often".
Please find the attached log.
Verify_gettimeofday02_correct_RDTSC_commit.txt
For today's nightly analysis, we saw this issue for Ubuntu 20.04 with 5.12 kernel and GSGX
@dimakuv - i think what you are saying is already being checked. see below.
if (diff_usec < TSC_REFINE_INIT_TIMEOUT_USECS) {
/* less than TSC_REFINE_INIT_TIMEOUT_USECS passed from the previous update of
* TSC/usec pair (time drift is contained), use the RDTSC-calculated time */
usec = start_usec + diff_usec;
if (usec < start_usec)
return -PAL_ERROR_OVERFLOW;
}
we could return usec = start_usec; instead of overflow. But the logs doesnt seem to show it getting into this path. But let me know if this is what you are talking about.
@aneessahib No, I am talking about this part (when we fallback to the OCALL): https://github.com/gramineproject/gramine/blob/487969c1dc1b0f27096f9d56312552e3ab6b754d/Pal/src/host/Linux-SGX/db_misc.c#L114
Here we just return the timestamp that was returned by ocall_gettime() (and slightly re-calculated afterwards).
What I would do is a sanity adjustment like this:
...
uint64_t tsc_cyc = tsc_cyc1 + (tsc_cyc2 - tsc_cyc1) / 2;
if (tsc_cyc < tsc_cyc1)
return -PAL_ERROR_OVERFLOW;
/* --- my proposed code starts --- */
static uint64_t previous_usec = 0;
if (usec < previous_usec) {
/* new timestamp is "back in time" than the previous timestamp; this can happen
* if RDTSC time drifted backwards compared to actual host time */
usec = previous_usec;
}
previous_usec = usec;
/* --- my proposed code ends --- */
/* refresh the baseline data if no other thread updated g_start_tsc */
write_seqbegin(&g_tsc_lock);
@jinengandhi-intel @aniket-intelx @aneessahib Is this still relevant?
I actually could not work on this any further after the linked issue was sorted out. @jinengandhi-intel - are we continuing to see this problem?
This testcase is still failing intermittently across various configurations. Some of them are listed below.
- DCAP with kernel 5.9.0-intel-next+
- RHEL with kernel 5.13.0
- CentOS 8 with kernel 5.14.17
- Ubuntu 20.04 with kernel 5.12.0-051200-generic
I cannot reproduce this issue of intermittent failures using Gramine latest master (https://github.com/gramineproject/gramine/commit/cd6a9cca9585110a9bcd5c63dcc75b5c4d49466b) on Ubuntu 20.04 with kernel 6.2.0-rc2 or 5.15.0-58-generic.
@jinengandhi-intel @aniket-intelx @aneessahib Is this still relevant? Thanks!
@dimakuv The proposed sanity check still looks reasonable to me even w/o the issue reproduction, do we want to move it forward? Thanks!
@dimakuv The proposed sanity check still looks reasonable to me even w/o the issue reproduction, do we want to move it forward? Thanks!
I'd say this could be even security-relevant? The app running inside the enclave may go crazy if time teleports back (well written apps shouldn't, but I don't think that people know that most OSes have time APIs which don't guarantee monotonicity).
@jinengandhi-intel Do we know if this repros on master?
Intermittently we see it at least a few times every week. I checked the data for last week and we saw it once for Debian11 and most other times its for Ubuntu18.04. One more thing, the issue was first created and found on older kernels like 5.12 and 5.13 but since then we have moved to newer kernels like 5.19 and 6.0 and we still see the issue.