ref-fvm icon indicating copy to clipboard operation
ref-fvm copied to clipboard

document system requirements

Open raulk opened this issue 3 years ago • 6 comments

A Butterflynet participant reports this panic following the nv16 migration:

2022-05-31T14:46:55.403 INFO filcrypto::util::types > create_fvm_machine: start
2022-05-31T14:46:55.403 INFO filcrypto::fvm::machine > using FVM V1
2022-05-31T14:46:55.403 DEBUG fvm::machine::default > initializing a new machine, epoch=333259, base_fee=100, nv=V16, root=bafy2bzaceathvy4cdfff6stvasq6ig6wv73lg2fpoxgnjqykuyvtynbv7x326
thread '<unnamed>' panicked at 'module_cache poisoned: PoisonError { .. }', /home/lotus/.cargo/registry/src/github.com-1ecc6299db9ec823/fvm-1.0.0-rc.2/src/machine/engine.rs:207:52
2022-05-31T14:46:55.404Z        ERROR   messagepool     messagepool/messagepool.go:1521 adding local message: failed to look up actor state nonce: computing tipset state for GetActor: making vm: Rust panic: no unwind information:

Their platform and architecture is Linux on ARM64:

Linux tegra-ubuntu 4.9.253-tegra #3 SMP PREEMPT Wed Aug 18 20:13:59 CST 2021 aarch64 aarch64 aarch64 GNU/Linux

This panic is raised here, failing a lock acquisition:

https://github.com/filecoin-project/ref-fvm/blob/master/fvm/src/machine/engine.rs#L208

Reading the Rust docs for Mutex#lock():

Errors If another user of this mutex panicked while holding the mutex, then this call will return an error once the mutex is acquired.

It seems like this error itself is a red herring, as it indicates that a previous acquirer panicked while the lock was held. I'll try to dig up more details from the log, but it looks platform-specific.

raulk avatar Jun 01 '22 11:06 raulk

There are two substantial things that happen in the Engine while holding the lock:

  • Wasm instrumentation: only substantial thing that happens in the Engine while holding the lock is the Wasm bytecode instrumentation: https://github.com/filecoin-project/ref-fvm/blob/beb88c950ceb5488c50cd9ddb503fa54b38dad1e/fvm/src/machine/engine.rs#L238 (not platform dependent AFAIK)
  • Wasm bytecode compilation (platform dependent!)

If any of those things panic, the lock will be poisoned.

It would be useful to add ARM64 as a build target for conformance tests on CI.

raulk avatar Jun 01 '22 12:06 raulk

Thanks to Benjamin for providing more detailed logs.

This is the original panic that ended up poisoning the lock:

2022-05-31T14:33:00.468 INFO filcrypto::fvm::machine > using FVM V1
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 22, kind: InvalidInput, message: "Invalid argument" }', /home/lotus/.cargo/registry/src/github.com-1ecc6299db9ec823/wasmtime-jit-0.37.0/src/code_memory.rs:63:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This is coming from these lines, which happen to be aarch64 linux specific code:

https://github.com/bytecodealliance/wasmtime/blob/0bdd8e3510ae83af1504f0c144acde168e052311/crates/jit/src/code_memory.rs#L56-L64

raulk avatar Jun 01 '22 13:06 raulk

@raulk can you get the linux version (uname -a)?

Stebalien avatar Jun 01 '22 15:06 Stebalien

Ok, the answer is 4.9. It looks like the minimum supported Linux kernel is 4.16 (https://kernelnewbies.org/Linux_4.16#membarrier.282.29_expedited_support), which was released about 4 years ago.

Unfortunately, it looks like Ubuntu 18.04 (the oldest LTS that doesn't require a support contract) still uses 4.15.

Stebalien avatar Jun 01 '22 18:06 Stebalien

However, ubuntu 18.04.6 ships a much newer kernel.

Stebalien avatar Jun 01 '22 18:06 Stebalien

I've retitled this as a chore to document system requirements.

raulk avatar Jun 13 '22 10:06 raulk