gramine [RFC] New interface for calling host-level syscalls

Description

For reasons why we would want this, please look at pros&cons.

The new version would change the way blocking host-level syscalls are done, the non-blocking, or rather "fast" syscalls would be called directly (same way as now). We would consider futex with FUTEX_WAKE to be a "fast" syscall and readto be a "slow" syscall. The "slow" syscalls wouldn't be called directly, but using a helper thread. E.g. it could look like:

We want to issue a host-level syscall read(fd, buf, size)
Write syscall number and arguments in some shared buffer.
Wake-up a dedicated host helper thread (probably using futexes).
Wait for helper thread being done - sleep using special sleeping function.
Helper thread issues a syscall, saves the return value, wakes up the original thread and goes back to sleep.

The assumption is that host-level thread are cheap and we could have one for each Graphene thread (which would basically double the number of host threads).

If an interrupt (signal) comes, then we notify the helper thread about it and wait for it finishing the job (it either goes back to sleep or writes the return values, if syscall was completed already). Why even bother with all of this and not just call syscalls directly? To use the special sleeping function (basically a futex, but that's not really important), which would be signals (Graphene level signals) aware.

Pros:

We would be able to handle signals coming at any moment, even when we are inside LibOS/PAL before issuing the syscall (currently we can block in such cases).
This would simplify signal handling, especially in Linux-SGX PAL (e.g. no need for weird EINTR injection and losing PAL state in case of ocall interrupt).

Cons:

Double the number of threads.
Some additional overhead on each "slow" syscall. I'm not sure how much would that slow the execution, but I suspect not much - need to be determined empirically. Note that we would do this only on "slow" syscalls, so this shouldn't be that bad (as they are "slow" anyway). The most (only?) noticeable overhead would be in case of "slow" syscall that actually do not block, e.g. read on file descriptor with some data already ready to be read.

Note that in case of PAL Linux-SGX we would do all of this in the untrusted part, so the overhead would probably be negligible.

Idea v2

Another approach could be wrapping each blocking syscall with something like this:

xor r11, r11
xchg r11, [some_per_thread_variable] ; this variable would be set to 1 by signal handling routines
cmp r11, 0
jne .skip
syscall
jmp .syscall_done
.skip:
mov rax, -EINTR
.syscall_done:

some_per_thread_variable would be set by LibOS code in an appropriate upcall iff we were interrupted inside LibOS or PAL code. What I don't like about this approach:

There is a 3 instruction window, which can still miss a signal (between xchg and syscall). The window could probably be narrowed down to 2 instructions, e.g. by replacing xchg and cmp with sub r11, [addr], but that does not solve the issue.
This would require accessing an untrusted variable some_per_thread_variable from LibOS. While this can be done in a secure manner (e.g. writing inline asm), the idea does sound nice and sets a dangerous precedence. Besides we would need to provide an interface for such access just because SGX needs it.

I personally dislike this idea even more than the first one.

Feb 26 '21 23:02 boryspoplawski

Not sure if I understand this correctly, so let me try to recap: This is in order to handle Graphene signals in the "special sleeping function" instead of signal handler, right? The main thread will get interrupted during sleep, and notify LibOS, while the helper thread will continue uninterrupted. And we can block signals in our code (Pal/LibOS) except for this function.

Apart from performance overhead, this sounds like trouble from complexity/maintenance point of view: more things can go wrong, it's harder to debug code because it involves multiple threads, etc.

Could the same be achieved by in-thread by making our syscall sites "Graphene signal aware"? I.e. when we receive an EINTR on something like read, we not only retry the system call in a loop (as we do now), but also notify LibOS/application. In other words, instead of signal-aware sleep function, we could have a signal-aware syscall wrapper.

Feb 28 '21 13:02 pwmarcz

Not sure if I understand this correctly, so let me try to recap: This is in order to handle Graphene signals in the "special sleeping function" instead of signal handler, right? The main thread will get interrupted during sleep, and notify LibOS, while the helper thread will continue uninterrupted. And we can block signals in our code (Pal/LibOS) except for this function.

No, that part of signals is already reworked: we are no longer handling signals in LibOS/PAL, they are only handled when returning from Graphene to use app. More info: https://github.com/oscarlab/graphene/blob/master/LibOS/shim/include/shim_internal.h#L156

Apart from performance overhead, this sounds like trouble from complexity/maintenance point of view: more things can go wrong, it's harder to debug code because it involves multiple threads, etc.

You could say that about any non-trivial change... Correctness > simplicity, I guess? Ofc to some extend.

Could the same be achieved by in-thread by making our syscall sites "Graphene signal aware"? I.e. when we receive an EINTR on something like read, we not only retry the system call in a loop (as we do now), but also notify LibOS/application. In other words, instead of signal-aware sleep function, we could have a signal-aware syscall wrapper.

That is not possible. If a signal arrives before the syscall, we want not to issue the syscall at all, which is not possible to be done atomically. Note that the EINTR case you mentioned is desirable: we can check if a Graphene signal arrived and then reissue the syscall or just return EINTR to user code (this we already do).

Could the same be achieved by in-thread by making our syscall sites "Graphene signal aware"?

Actually, this is exactly my approach to making syscall sites "Graphene signal aware". Non-multithreaded ideas are welcome.

Feb 28 '21 16:02 boryspoplawski

Borys, could you provide some more background? I'm always getting lost in our signal handling...

Are you talking about asynchronous signals only (SIGTERM and SIGCONT)?
What happens in Graphene (in your proposal) when a "fast syscall emulation" is interrupted by a signal somewhere in PAL (before or after issuing the host syscall)? I guess the PAL just proceeds to completion of such emulation?
Could you remind why injecting -EINTR before the "host slow syscall" is a bad idea? Doesn't it just work (because there was no host-level syscall, so no real irrevocable change in the environment)?
Could you remind why inspecting the syscall results after the "host slow syscall" has finished is a bad idea? In other words, why can't we implement special-cases for each of the used syscalls (OCALLs)?

Other than that, I have two notes:

Multi-threaded syscall emulation will be indeed very hard to debug. I agree with Pawel.
We already have a similar design for Exitless; I don't know if you're familiar with this feature: https://github.com/oscarlab/graphene/blob/master/Pal/src/host/Linux-SGX/rpc_queue.h and https://github.com/oscarlab/graphene/blob/master/Pal/src/host/Linux-SGX/enclave_ocalls.c#L51. With your proposed design, this feature will break (so it will need some modifications).

Do you think this comment is wrong and we can't really do this: https://github.com/oscarlab/graphene/blob/master/Pal/src/host/Linux-SGX/enclave_ocalls.c#L15-L18

Mar 01 '21 10:03 dimakuv

1. Are you talking about asynchronous signals only (SIGTERM and SIGCONT)?

Yes (basically only SIGCONT is on any interest), but note that such signal appears on each Graphene signal.

2. What happens in Graphene (in your proposal) when a "fast syscall emulation" is interrupted by a signal somewhere in PAL (before or after issuing the host syscall)? I guess the PAL just proceeds to completion of such emulation?

Such signal is completely ignored, as we are returning to user app soon anyway (and the signal will be handled then).

3. Could you remind why injecting `-EINTR` before the "host slow syscall" is a bad idea? Doesn't it just work (because there was no host-level syscall, so no real irrevocable change in the environment)?

How do you want to inject EINTR and into what?

4. Could you remind why inspecting the syscall results after the "host slow syscall" has finished is a bad idea? In other words, why can't we implement special-cases for each of the used syscalls (OCALLs)?

I do not understand this question. The problem is in a signal (both Graphene and host level, it's the same thing in this context) arriving e.g. just before issuing an blocking ocall).

Other than that, I have two notes:

* Multi-threaded syscall emulation will be indeed very hard to debug. I agree with Pawel.

I think it should be ok, outside of some corner cases and debugging this feature itself.

* We already have a similar design for Exitless; I don't know if you're familiar with this feature: https://github.com/oscarlab/graphene/blob/master/Pal/src/host/Linux-SGX/rpc_queue.h and https://github.com/oscarlab/graphene/blob/master/Pal/src/host/Linux-SGX/enclave_ocalls.c#L51. With your proposed design, this feature will break (so it will need some modifications).

The design is similar, but not the same:

Here the waiting thread will sleep normally (not busy wait).
Here everything happens in untrusted part.
I don't think it will break, why would it?

Do you think this comment is wrong and we can't really do this: https://github.com/oscarlab/graphene/blob/master/Pal/src/host/Linux-SGX/enclave_ocalls.c#L15-L18

Yes, mostly:

The mentioned comment is only about handling interrupts in untrusted part (and this proposal fixes it in all parts).
I don't see this being possible in the way described in the comment, we cannot unwind arbitrary code. The mere check whether we arrived after or before syscall instruction seems to be undecidable. I had some variation of such scheme in mind, maybe I should write it down here as well (it was quite complex and required reading a variable in untrusted memory from LibOS).

Mar 01 '21 13:03 boryspoplawski

I don't see this being possible in the way described in the comment, we cannot unwind arbitrary code. The mere check whether we arrived after or before syscall instruction seems to be undecidable. I had some variation of such scheme in mind, maybe I should write it down here as well (it was quite complex and required reading a variable in untrusted memory from LibOS).

Why "arbitrary code"? We can have case-by-case logic for unwinding. In other words, for e.g. ocall_read() we memorize the RIP before the host-level read() and the RIP after it. Then in our signal-handling logic, we compare the RIP-at-which-interrupted with the memorized RIPs and figure out that we were interrupted "in ocall_read() after the host-level syscall" -- thus we need to "roll forward" with this particular ocall. We do something similar in the SGX assembly.

I understand that this implementation may be even more complex than what you're proposing, but why is this impossible?

Mar 01 '21 14:03 dimakuv

We do something similar in the SGX assembly.

We do that for 3 instructions, not for arbitrary call stacks.

I understand that this implementation may be even more complex than what you're proposing, but why is this impossible?

What if we get interrupted inside ocall_read before the syscall, while holding some lock? What if we get interrupted before issuing ocall_read in trusted PAL or even LibOS?

Mar 01 '21 15:03 boryspoplawski

What if we get interrupted inside ocall_read before the syscall, while holding some lock?

Ok, yes, I never actually went through this. I was under the impression that we can "simply check if lock is taken and unlock". But now I understand that such implementation would end up in an instruction-by-instruction emulation...

Thanks, at least I now understand that a naive idea of "let's try to unwind case by case" is impossible. I don't have any other idea currently.

Mar 01 '21 16:03 dimakuv

Updated the top comment with another approach.

Mar 02 '21 01:03 boryspoplawski

Sorry to be late to the party; if this is still under active discussion, my main question is whether we could build this and exitless support on the same substrate?

Mar 26 '21 20:03 donporter

@donporter still under discussion and we don't plan to implement this soon. It's just an RFC to discuss the idea and see what others think ;)

Mar 26 '21 21:03 mkow

(...) my main question is whether we could build this and exitless support on the same substrate?

These are rather orthogonal features: the proposed idea works completely in untrusted part. While it could probably be special cased with exitless (e.g. instead of sleeping thread would actively spin in enclave), it needs careful design. Original idea assumed waking up the helper thread, which probably needs exiting the enclave anyway, so I'm not sure if this is doable, but I've not given it much thought, maybe there is a way.

Mar 27 '21 17:03 boryspoplawski

@boryspoplawski Was there anything new on this issue since March? Looks like not really, we haven't worked on this problem.

Nov 25 '21 08:11 dimakuv

No there was not, though a similar issue for untrusted part of PAL Linux-SGX was solved in https://github.com/gramineproject/graphene/pull/2602

Also it seems that after the above PR we are not seeing any issues with this - probably no real workloads use signals extensively and depend on this.

Nov 25 '21 09:11 boryspoplawski

I'll close this PR as it seems to be not a problem for Gramine.

Mar 09 '23 14:03 dimakuv

gramine gramine copied to clipboard

[RFC] New interface for calling host-level syscalls

Description

Idea v2

gramine
gramine copied to clipboard