sgx-lkl icon indicating copy to clipboard operation
sgx-lkl copied to clipboard

[Test] Fix and enable 10 tests disabled with PR 789

Open hukoyu opened this issue 4 years ago • 14 comments

Below tests disabled with PR: https://github.com/lsds/sgx-lkl/pull/789/files

  • [ ] gettimeofday02
  • [ ] mmap11
  • [ ] futex_cmp_requeue01
  • [ ] getcwd04
  • [ ] send01
  • [ ] setresuid04
  • [ ] setreuid07
  • [ ] symlink01
  • [ ] fstat03 (Disabled in https://github.com/lsds/sgx-lkl/pull/812)
  • [ ] chroot03 (Disabled in https://github.com/lsds/sgx-lkl/pull/812)

Fix the failure reason and enable back. cc @KenGordon @SeanTAllen @davidchisnall @vtikoo @paulcallen

hukoyu avatar Aug 15 '20 19:08 hukoyu

gettimeofday02 passing previously would appear to be happenstance related to async signal bugs. Any test that depends on the delivery of an async signal is likely to fail. gettimeofday02 should either be heavily patched or left disabled with a note to enable once async signal handling is fixed (which is a p0 issue).

The test in its current states hangs as the alarm to stop the test "isn't being delivered" which is a known open issue.

https://github.com/lsds/sgx-lkl/issues/209

Is there some place we want to record that when #209 is closed, that we should enable gettimeofday02?

SeanTAllen avatar Aug 17 '20 12:08 SeanTAllen

mmap11 "failure" is unrelated to the functionality under test. It appears to be a deterministic shutdown hang. For me, it happens with both hw and sw modes.

Given that https://github.com/lsds/sgx-lkl/pull/788 exists and will change the shutdown sequence, I propose waiting to address the mmap11 failure until 788 is merged.

from @vtikoo:

mmap11 creates a detached pthread - https://github.com/lsds/ltp/blob/sgx-lkl/testcases/kernel/syscalls/mmap/mmap11.c#L101. Theres an open p0 for fixing detached thread support #779.

SeanTAllen avatar Aug 17 '20 12:08 SeanTAllen

futex_cmp_requeue01 hangs because

while (thread_cnt < tc->num_waiters) {
    sched_yield();
}

never exits.

here's the full-test:

https://github.com/lsds/ltp/blob/sgx-lkl/testcases/kernel/syscalls/futex/futex_cmp_requeue01.c

SeanTAllen avatar Aug 17 '20 13:08 SeanTAllen

a PR has been opened to address the write05 test:

https://github.com/lsds/ltp/pull/73

SeanTAllen avatar Aug 17 '20 13:08 SeanTAllen

Regarding futex_cmp_requeue01, sched_yield now goes via LKL instead of directly calling lthread_yield - https://github.com/lsds/sgx-lkl-musl/pull/18/files#diff-687e538b71be7b81c2d4ddf641470487.

This could be a regression. Is this a determinsitic failure?

vtikoo avatar Aug 17 '20 13:08 vtikoo

Regarding futex_cmp_requeue01, sched_yield now goes via LKL instead of directly calling lthread_yield - https://github.com/lsds/sgx-lkl-musl/pull/18/files#diff-687e538b71be7b81c2d4ddf641470487.

This could be a regression. Is this a determinsitic failure?

@vtikoo it is for me.

SeanTAllen avatar Aug 17 '20 13:08 SeanTAllen

Regarding futex_cmp_requeue01, sched_yield now goes via LKL instead of directly calling lthread_yield - https://github.com/lsds/sgx-lkl-musl/pull/18/files#diff-687e538b71be7b81c2d4ddf641470487.

This could be a regression. Is this a determinsitic failure?

I indeed believe that this is causing problems. I think it's one of the reasons why I see shutdown issues with DotNet here: https://github.com/lsds/sgx-lkl/pull/788#issue-467907462

prp avatar Aug 17 '20 13:08 prp

getcwd04 exits because it checks to make sure there is not 1 cpu.

  if (tst_ncpus() == 1)
     tst_brk(TCONF, "This test needs two cpus at least");

If that was fixed by removing the test for CPUs (I don't think it is needed given how we patched to to use threads), it then fails because it is relying on the delivery of an asynchronous signal. It should be able to be re-enabled once #209 is fixed.

SeanTAllen avatar Aug 17 '20 13:08 SeanTAllen

setresuid04 and setreuid07 are working and can be re-enabled.

SeanTAllen avatar Aug 17 '20 13:08 SeanTAllen

@prp there seem to be multiple ways which can cause cloned host tasks hangups. Could you clarify whether you think the DotNet failures are specifically related to sched_yield or cloned host task hangups in general?

vtikoo avatar Aug 17 '20 13:08 vtikoo

In most cases, I see a deadlock in which the termination thread fails to obtain a CPU lock for syscalls but nothing else is running. The DotNet hang is different: one of the DotNet userspace threads keeps invoking sched_yield() and making futex calls, while the termination thread is waiting for the CPU lock.

prp avatar Aug 17 '20 14:08 prp

send01 is failing because it hangs. there's a call in a thread to select that never returns. I'm not sure why it was passing previously.

a couple things that won't work to fix right now:

  • using pthread_cancel to cancel the thread. currently it segfaults. pthread_cancel is using signals so its problematic.
  • setting a timeout on the select call. it doesn't always timeout for reasons that I havent' looked into yet.

SeanTAllen avatar Aug 17 '20 14:08 SeanTAllen

write05 patched with https://github.com/lsds/ltp/pull/73 and then with lsds/ltp#74

hukoyu avatar Aug 17 '20 18:08 hukoyu

Tests fstat03 and and chroot03 are also failing due to SIGSEGV (page fault) signal issue. These two were not caught before since there were build errors and test binaries were not generated (Created issue https://github.com/lsds/sgx-lkl/issues/810 to track that). Fixed the build issue with https://github.com/lsds/ltp/pull/74 and disabled these two tests in PR https://github.com/lsds/sgx-lkl/pull/812

hukoyu avatar Aug 20 '20 17:08 hukoyu