sgx-lkl CI is avoidably slow and inefficient

CI speed is a limiting factor for merging PRs. We require PRs to pass CI and be approved, but once one PR has merged the next PR to merge must wait for CI to re-run to ensure that it was not broken by the prior one.

We are currently running the LTP tests on two VMs. These are 4-core VMs, but we are running only one test at a time. We could split the LTP tests into more parts and run on more VMs, but then the building becomes the bottleneck.

A complete plan should:

[x] Run the build on the non-SGX pool and produce artefacts for other stages.
[x] Run the main test suite in software mode on the non-SGX pool.
[x] Run the main test suite in hardware mode on the SGX pool.
[ ] Run the LTP tests in parallel (these tests are small so we ought to be able to run 2-4 of them in parallel on a single 4-core VM).

If this is still slow, we can then split the LTP and non-LTP test suites into finer groupings to run more in parallel.

May 01 '20 13:05 davidchisnall

I'm happy to take the first three items but would appreciate if someone could work on the last item (parallelising of LTP tests).

For the first three I would use the Debian package as artifact.

May 01 '20 13:05 letmaik

I'm happy to take the first three items but would appreciate if someone could work on the last item (parallelising of LTP tests).

For the first three I would use the Debian package as artifact.

@letmaik , please help with first three.

May 01 '20 13:05 Pengpeng-Microsoft

The EPC memory size specified for ltp tests is 512 MB. We should experiment with threading to find out how much time the parallelism will actually save in the face of EPC paging.

May 01 '20 18:05 jxyang

EPC size is fixed by hardware, I guess you mean heap size? Looking at the LTP script it seems it's actually the LKL kernel memory size which is set: https://github.com/lsds/sgx-lkl/blob/f0d4b2a54e45478d358a4c4ec922c81b8133446c/tests/ltp/run_ltp_test.sh#L59 Does this make sense? Do the tests really need so much?

May 01 '20 19:05 letmaik

We are running the LKL kernel inside enclave. That effectively means EPC memory, right? Per a previous discussion, that amount of memory is required when we create a loop device for a ltp test case.

May 01 '20 20:05 jxyang

We are running the LKL kernel inside enclave. That effectively means EPC memory, right? Per a previous discussion, that amount of memory is required when we create a loop device for a ltp test case.

@jxyang I don't think that it's a good idea to increase LKL memory in this way for the LTP tests. Essentially this will mask other problems where particular tests/syscalls would otherwise run out of LKL memory. In a normal configuration, LKL memory is only 32MB (or 64MB), and we should stick with that (until we integrate the kernel and enclave memory allocators).

The correct solution here is to update the LTP test not to use a loopback block device but instead create a regular image file on the root file system.

May 02 '20 07:05 prp

We are running the LKL kernel inside enclave. That effectively means EPC memory, right?

Not exactly. The tests have a 1GB allocation that can be paged into EPC. EPC needs to be large enough to contain their working set if we are to get good performance. For most of the LTP tests, I would imagine that the working set if very small (considerably smaller than the total available EPC size). I expect some parallelism will show a speedup, but if we increase parallelism such that EPC is exhausted and swapping occurs then we will see a significant slowdown (factor of 10 or so) which will completely offset the win from parallelism. This is why I suggested 2-4, rather than 4-way parallel. We need to run some experiments to determine the sweet spot (and possibly mark some tests as very memory-intensive.

Per a previous discussion, that amount of memory is required when we create a loop device for a ltp test case.

Having a loopback filesystem will increase EPC pressure. As @prp points out, using an unencrypted external filesystem will likely have a lower memory impact.

Note that, as part of the cmake work (see the wip-cmake branch), I would eventually like to move the tests over to using ctest as a test runner. This will give us parallel test execution for free (and, in particular, will allow us to build the tests in parallel independently of running the tests in parallel) and so it might not be worth doing anything custom. In fact, it might be worth adding cmake / ctest for the test suite independently of moving the build system over - since we want to run the tests on a system without a build, having a separate CMake build for the tests that the main CMake build can eventually just incorporate and provide paths in the build directory for running tests locally.

May 02 '20 10:05 davidchisnall

Points 1-3 are done in #159 ~~and I think point 4 is not necessary anymore as it turns out running an individual leg of LTP tests is rather fast and the bottleneck now are our own tests (notably the openvino test which takes long to build, if the Docker image wasn't cached).~~ EDIT: LTP tests weren't running, see https://github.com/lsds/sgx-lkl/pull/173. Point 4 will still be useful.

May 04 '20 12:05 letmaik

sgx-lkl sgx-lkl copied to clipboard

CI is avoidably slow and inefficient

sgx-lkl
sgx-lkl copied to clipboard