lucet
lucet copied to clipboard
Instance::run panics if another instance trapped in the same thread
It looks like thread-local state becomes infected once an instance traps (e.g. by executing unreachable
), and when the next instance is created in the same thread, it panics with the following backtrace:
thread '<unnamed>' panicked at 'enabling or changing the signal stack succeeds: Sys(EPERM)', src/libcore/result.rs:1084:5
stack backtrace:
0: std::sys_common::backtrace::print
1: std::panicking::default_hook::{{closure}}
2: std::panicking::default_hook
3: <std::panicking::begin_panic::PanicPayload<A> as core::panic::BoxMeUp>::get
4: std::panicking::continue_panic_fmt
5: std::panicking::try::do_call
6: <T as core::any::Any>::type_id
7: core::ptr::real_drop_in_place
8: core::result::Result<T,E>::expect
at /private/tmp/nix-build-rustc-1.38.0.drv-0/rustc-1.38.0-src/src/libcore/result.rs:879
9: lucet_runtime_internals::instance::signals::<impl lucet_runtime_internals::instance::Instance>::with_signals_on
at /Users/lifted/Projects/dfinity/rs/.cargo-home/registry/src/github.com-1ecc6299db9ec823/lucet-runtime-internals-0.1.1/src/instance/signals.rs:65
10: lucet_runtime_internals::instance::Instance::run_func
at /Users/lifted/Projects/dfinity/rs/.cargo-home/registry/src/github.com-1ecc6299db9ec823/lucet-runtime-internals-0.1.1/src/instance.rs:612
11: lucet_runtime_internals::instance::Instance::run
The panic comes from this place:
impl Instance {
pub(crate) fn with_signals_on<F, R>(&mut self, f: F) -> Result<R, Error>
where
F: FnOnce(&mut Instance) -> Result<R, Error>,
{
// Set up the signal stack for this thread. Note that because signal stacks are per-thread,
// rather than per-process, we do this for every run, while the signal handler is installed
// only once per process.
let guest_sigstack = SigStack::new(
self.alloc.slot().sigstack,
SigStackFlags::empty(),
libc::SIGSTKSZ,
);
let previous_sigstack = unsafe { sigaltstack(Some(guest_sigstack)) } // <-- PANICS
.expect("enabling or changing the signal stack succeeds");
You can reproduce the issue using the following small crate:
src/main.rs
use lucet_runtime::{DlModule, MmapRegion, Region};
use lucet_runtime_internals::alloc::Limits;
use lucetc::Lucetc;
fn main() {
let wasm = wabt::wat2wasm(
r#"
(module
(func (export "trap") unreachable)
(func (export "test")))
"#,
)
.unwrap();
let lucetc = Lucetc::try_from_bytes(wasm.as_slice()).unwrap();
lucetc.shared_object_file("mod.so").unwrap();
let module = DlModule::load("mod.so").unwrap();
let memory: std::sync::Arc<MmapRegion> = MmapRegion::create(1, &Limits::default()).unwrap();
for func_name in &["trap", "test"] {
let mut instance_handle = memory.new_instance_builder(module.clone()).build().unwrap();
println!("{:?}", instance_handle.run(func_name, &[]));
}
}
Cargo.toml
[package]
name = "min-example"
version = "0.1.0"
edition = "2018"
[dependencies]
lucet-runtime = "0.1.1"
lucetc = "0.1.1"
lucet-runtime-internals = "0.1.1"
wabt = "0.7.4"
The issue disappears if every instance is created in a separate thread:
src/main.rs
use lucet_runtime::{DlModule, MmapRegion, Region};
use lucet_runtime_internals::alloc::Limits;
use lucetc::Lucetc;
fn main() {
let wasm = wabt::wat2wasm(
r#"
(module
(func (export "trap") unreachable)
(func (export "test")))
"#,
)
.unwrap();
let lucetc = Lucetc::try_from_bytes(wasm.as_slice()).unwrap();
lucetc.shared_object_file("mod.so").unwrap();
let module = DlModule::load("mod.so").unwrap();
for func_name in &["trap", "test"] {
let module = module.clone();
let handler = std::thread::spawn(move || {
let memory: std::sync::Arc<MmapRegion> =
MmapRegion::create(1, &Limits::default()).unwrap();
let mut instance_handle = memory.new_instance_builder(module).build().unwrap();
println!("{:?}", instance_handle.run(func_name, &[]));
});
handler.join().unwrap();
}
}
Hi @roman-kashitsyn, thanks for the report! Unfortunately I'm so far unable to replicate this issue, at least on my Linux machine. Can you share more about your OS environment?
Also, the "0.1.1" versions of the Lucet packages on crates.io are pretty old at this point, it may also be worth trying the current versions from master
. I tried your example with both, and it succeeded both times, though.
Sorry, I forgot to mention that I observe the issue on macOS 10.14 and 10.15. I'll try to reproduce it on current master
.
I got access to a Mac to test this out, and can confirm that it's also an issue with master
. Mac OS is a pretty experimental platform for us; most of the developers and our deployment environment are x86-64 Linux. As a result, we may not be able to get to the bottom of this very soon, but we're happy to provide guidance if you're interested in troubleshooting.
It appears there's something strange going on with the semantics of alternate signal handler stacks on Mac. When a Lucet guest traps, a SIGILL
is raised and handled by the runtime, which swaps back to the host context and returns an error. After the swap to the host context, the thread is no longer on the alternate signal stack, but getting EPERM
from that call suggests that the OS believes that signal stack is still active. Perhaps there's a system call we need to make on Macs only to convince the OS that we've jumped away from the handler?
I'll check if the problem still reproducible if I call sigaltstack
with SS_DISABLE
flag before setting the new stack.
It's weird that the problem is not reproducible on Linux, the man page says
EPERM An attempt was made to change the alternate signal stack while
it was active (i.e., the process was already executing on the
current alternate signal stack).
As far as I understand, that should mean that it's illegal to change the alternative signal stack if it has already been used to handle a signal.
We should never be swapping out the alt stack when the signal handler is running, only after we've swapped back to the host context. Linux must correctly recognize when we've jumped away from the handler, where Mac OS still believes it's running.
I tried to construct a small C program reproducing the macOS issue with sigaltstack
, but I didn't manage to trigger the issue. Here is the program I tried:
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
static volatile int G_caught_sig = 0;
static char G_sig_stack_1[SIGSTKSZ];
static char G_sig_stack_2[SIGSTKSZ];
void handler(int sig, siginfo_t *_info, void *_arg) {
G_caught_sig = sig;
}
int main() {
stack_t ss, old_ss;
ss.ss_size = SIGSTKSZ;
ss.ss_flags = 0;
ss.ss_sp = &G_sig_stack_1;
if (sigaltstack(&ss, &old_ss) < 0) {
perror("sigaltstack");
exit(EXIT_FAILURE);
}
struct sigaction sa;
sa.sa_flags = SA_ONSTACK | SA_SIGINFO;
sa.sa_sigaction = &handler;
sigemptyset(&sa.sa_mask);
if (sigaction(SIGILL, &sa, NULL) < 0) {
perror("sigaction");
exit(EXIT_FAILURE);
}
raise(SIGILL);
printf("Caught signal = %d\n", G_caught_sig);
ss.ss_sp = &G_sig_stack_2;
if (sigaltstack(&ss, &old_ss) < 0) {
perror("sigaltstack");
exit(EXIT_FAILURE);
}
return 0;
}
UPD: One interesting difference I observed is that the syscall trace of this small C program contains sigreturn
(a syscall that cleans up the stack), while the trace of the rust program in the issue description doesn't have this syscall.
UPD: One interesting difference I observed is that the syscall trace of this small C program contains
sigreturn
(a syscall that cleans up the stack), while the trace of the rust program in the issue description doesn't have this syscall.
This is more evidence that Mac OS behaves differently when jumping out of a signal handler, rather than returning. Unfortunately for many of the signals that a guest raises when faulting can't be immediately resolved, so it would just keep reraising the same signal if we returned from the handler. I wonder if there's a syscall similar to sigreturn
that we could call to make it clear to the OS that we're no longer on the signal stack?
Unfortunately for many of the signals that a guest raises when faulting can't be immediately resolved, so it would just keep reraising the same signal if we returned from the handler.
Yes, that makes a lot of sense. I wrote another small C program that uses longjmp
to get out of a signal handler to avoid being stuck in an infinite loop.
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <setjmp.h>
#include <unistd.h>
#include <sys/mman.h>
static jmp_buf jump_buffer;
static volatile int G_caught_sig = 0;
static volatile char *G_memory = NULL;
static char G_sig_stack_1[SIGSTKSZ];
static char G_sig_stack_2[SIGSTKSZ];
void handler(int sig, siginfo_t *_info, void *_arg) {
G_caught_sig = sig;
longjmp(jump_buffer, sig);
}
int main() {
G_memory = mmap(0, getpagesize(), PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (G_memory == MAP_FAILED) {
perror("mmap");
exit(EXIT_FAILURE);
}
stack_t ss, old_ss;
ss.ss_size = SIGSTKSZ;
ss.ss_flags = 0;
ss.ss_sp = &G_sig_stack_1;
if (sigaltstack(&ss, &old_ss) < 0) {
perror("sigaltstack");
exit(EXIT_FAILURE);
}
struct sigaction sa;
sa.sa_flags = SA_ONSTACK | SA_SIGINFO;
sa.sa_sigaction = &handler;
sigemptyset(&sa.sa_mask);
if (sigaction(SIGBUS, &sa, NULL) < 0) {
perror("sigaction");
exit(EXIT_FAILURE);
}
while (setjmp(jump_buffer) == 0) {
printf("Setting memory at %p\n", G_memory);
G_memory[0] = 'z';
}
printf("Caught signal = %d\n", G_caught_sig);
ss.ss_sp = &G_sig_stack_2;
if (sigaltstack(&ss, &old_ss) < 0) {
perror("sigaltstack");
exit(EXIT_FAILURE);
}
return 0;
}
The program terminates successfully.
The syscall trace still contains sigreturn
(most likely it's called by longjmp
). I also took a quick look at the latest available xnu kernel sources, it appears that sigreturn
is the only place where the SA_ONSTACK
flag causing the EPERM
error is cleared.
Ah, it looks like sigreturn
is more flexible than the one available on Linux. We might be able to use it to jump back to the host context on a fault, instead of our own Context::set_from_signal
. Or, we might have to use a C trampoline that we can setjmp
from, so that we can longjmp
out of the handler without having to get our hands dirty with what appears to be an undocumented syscall.
Again, we probably won't be able to dive deeply into this in the short term, but this is very helpful diagnostic information for us. Thank you!