sui
sui copied to clipboard
Running docker containers in docker-compose consistently fails on node 4 of 4
Description
I've set up a docker-compose file in conjuction with TM nodes. Without fail, the last worker node from the last narwhal node, fails to setup rocks db and immediately shuts down after a failed startup.
docker_run-narwhal-prim3-worker0-1 | rm: cannot remove '/validators/validator-2/db-worker-0/OPTIONS-000011': No such file or directory
docker_run-narwhal-prim3-worker0-1 | Bootstrapping new worker node with id 0
docker_run-narwhal-prim3-worker0-1 | 2022-08-15T15:40:06.295497506Z ERROR open_cf{path="/validators/validator-2/db-worker-0" cf=["headers", "certificates", "payload", "batches", "last_committed", "sequence"]}:open_cf_opts{path="/validators/validator-2/db-worker-0"}: typed_store::rocks: error=rocksdb error: IO error: No such file or directory: While opening a file for sequentially reading: /validators/validator-2/db-worker-0/MANIFEST-000001: No such file or directory
docker_run-narwhal-prim3-worker0-1 | 2022-08-15T15:40:06.295522298Z ERROR open_cf{path="/validators/validator-2/db-worker-0" cf=["headers", "certificates", "payload", "batches", "last_committed", "sequence"]}: typed_store::rocks: error=rocksdb error: IO error: No such file or directory: While opening a file for sequentially reading: /validators/validator-2/db-worker-0/MANIFEST-000001: No such file or directory
docker_run-narwhal-prim3-worker0-1 | thread 'main' panicked at 'Cannot open database: RocksDBError("IO error: No such file or directory: While opening a file for sequentially reading: /validators/validator-2/db-worker-0/MANIFEST-000001: No such file or directory")', /narwhal/node/src/lib.rs:70:10
docker_run-narwhal-prim3-worker0-1 | stack backtrace:
docker_run-narwhal-prim3-worker0-1 | 0: rust_begin_unwind
docker_run-narwhal-prim3-worker0-1 | at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
docker_run-narwhal-prim3-worker0-1 | 1: core::panicking::panic_fmt
docker_run-narwhal-prim3-worker0-1 | at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
docker_run-narwhal-prim3-worker0-1 | 2: core::result::unwrap_failed
docker_run-narwhal-prim3-worker0-1 | at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/result.rs:1785:5
docker_run-narwhal-prim3-worker0-1 | 3: node::NodeStorage::reopen
docker_run-narwhal-prim3-worker0-1 | 4: node::run::{{closure}}
docker_run-narwhal-prim3-worker0-1 | 5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
docker_run-narwhal-prim3-worker0-1 | 6: node::main
docker_run-narwhal-prim3-worker0-1 | note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
docker_run-narwhal-prim3-worker0-1 exited with code 101
- additionally, other nodes fail to start b/c of other files not existing temporarily, and the process/container exits b/c of it
Steps to reproduce
- first run
make build_docker_tm
from this make file https://github.com/PeggyJV/tendermint/blob/feat/narwhal/mempool/narwhal/internal/Makefile#L9 - run
docker compose up
from this at the directory of htis compose file: https://github.com/PeggyJV/tendermint/blob/feat/narwhal/mempool/narwhal/internal/narwhalmint/docker_run/docker-compose.yml - should fail on startup after
docker compose down && docker compose up
, without fail
Note
I'm running on a m1 pro macbook pro with Docker Desktop 4.11.1 (84025), which is currently the newest version available.
Additionally, the tendermint integration with the worker node, is using the stream API to submit txs.
another issue reared its head with the docker containers. There are times when shutting down teh container, docker fails to shut it down, failling to receive an exit event:
> docker compose down 10:27:28
[+] Running 4/8
⠿ Container docker_run-tm-node2-1 Removed 0.0s
⠿ Container docker_run-tm-node0-1 Removed 0.0s
⠿ Container docker_run-tm-node3-1 Removed 0.0s
⠿ Container docker_run-tm-node1-1 Removed 0.0s
⠿ Container docker_run-narwhal-prim3-worker0-1 Error while Stopping 24.1s
⠿ Container docker_run-narwhal-prim1-worker0-1 Error while Stopping 24.1s
⠿ Container docker_run-narwhal-prim0-worker0-1 Error while Stopping 24.1s
⠿ Container docker_run-narwhal-prim2-worker0-1 Error while Stopping 24.1s
Error response from daemon: cannot stop container: f52fc8ad73264bf27a22c78c4d4cd2f0a45d6a113da174cd21d1b260d3c248ff: tried to kill container, but did not receive an exit event
> docker rm -f $(docker ps -aq) 10:28:27
4264b4ad593b
Error response from daemon: Could not kill running container f735379bf85b7ff8ace15332ccb744990f1dbbd815e3dde15862a7ee63763977, cannot remove - tried to kill container, but did not receive an exit event
Error response from daemon: Could not kill running container 987c023626c933aa74c8c280190dfcf523d3b1427d65db9defb35b980ea222ca, cannot remove - tried to kill container, but did not receive an exit event
Error response from daemon: Could not kill running container f52fc8ad73264bf27a22c78c4d4cd2f0a45d6a113da174cd21d1b260d3c248ff, cannot remove - tried to kill container, but did not receive an exit event
Error response from daemon: Could not kill running container 488a905332a9629ae759d754ff8e5c567301b6e0ab2aa497cc2c0293920deb7f, cannot remove - tried to kill container, but did not receive an exit event
Error response from daemon: Could not kill running container 46fd11167d9a673cffe2302f35634e1f0dee2a03480ea24de7e2015f1892f6a5, cannot remove - tried to kill container, but did not receive an exit event
Error response from daemon: Could not kill running container 33ed262954b30e109b7e28244f3e74906066cc6d693313a1280b8f76eb443687, cannot remove - tried to kill container, but did not receive an exit event
Error response from daemon: Could not kill running container 9ed6e0e91bbf51271c3924692083bf7db36a2a09055efa92afe811c88f2f0bde, cannot remove - tried to kill container, but did not receive an exit event
when this happens (it is relatively infrequent, but a royal pain).... you have to hard kill docker itself and then cleanup the mess around it.
additionally, the mounted logs never produce. Looking at the docker compose, we're not teeing the log output to a file, perhaps that's why?
just rebased from master(commit: e54611ee0387735b1963eecfc4fdc120ee8ac509), entire docker-compose setup within narwhal doesn't work anymore.... Looks to be some env var not present that is expected? Now all the primaries exhibit this and exit status code 1 almost immediately after start:
2022-08-16T17:34:42.788472971Z INFO open_cf{path="/validators/validator-2/db-primary" cf=["headers", "certificates", "payload", "batches", "last_committed", "sequence", "temp_batches"]}: typed_store::rocks: Env var MYSTEN_DB_WRITE_BUFFER_SIZE_MB is not set: environment variable not found
2022-08-16T17:34:42.788496846Z INFO open_cf{path="/validators/validator-2/db-primary" cf=["headers", "certificates", "payload", "batches", "last_committed", "sequence", "temp_batches"]}: typed_store::rocks: Env var MYSTEN_DB_WAL_SIZE_MB is not set: environment variable not found
2022-08-16T17:34:42.793563596Z INFO open_cf{path="/validators/validator-2/db-primary" cf=["headers", "certificates", "payload", "batches", "last_committed", "sequence", "temp_batches"]}:open_cf_opts{path="/validators/validator-2/db-primary"}: typed_store::rocks: Env var MYSTEN_DB_WRITE_BUFFER_SIZE_MB is not set: environment variable not found
2022-08-16T17:34:42.795720387Z INFO open_cf{path="/validators/validator-2/db-primary" cf=["headers", "certificates", "payload", "batches", "last_committed", "sequence", "temp_batches"]}:open_cf_opts{path="/validators/validator-2/db-primary"}: typed_store::rocks: Env var MYSTEN_DB_WAL_SIZE_MB is not set: environment variable not found
its unclear to me what's causing the failures. If those env vars are not set..... perhaps assume a default value?
moving back to commit: 2de9cd02 and I'm able to run TM/narwhal together again. The only issue in this branch is the OG issue of this issue.
PR to update the logging here: https://github.com/MystenLabs/narwhal/pull/803
I do not manage to reproduce this issue, though I must admit the machine I tried this on is pretty beefy (bare metal Linux, 16 cores, 64GB memory). I tried commit 6103f64d0e5dbe6e16e39554e5be27de693e3501 on this repo.
Here's the docker stats of all working containers:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
b3876cde8a62 docker_run-narwhal-prim1-1 123.52% 6.559MiB / 62.78GiB 0.01% 12.6kB / 11.4kB 0B / 745kB 38
c4e3ddbd27a3 docker_run-narwhal-prim3-1 121.82% 6.484MiB / 62.78GiB 0.01% 9.75kB / 6.46kB 0B / 745kB 38
f597eb04c0ac docker_run-narwhal-prim1-worker0-1 0.02% 5.926MiB / 62.78GiB 0.01% 2.53kB / 0B 0B / 729kB 38
d77a27057bcf docker_run-narwhal-prim2-worker0-1 0.02% 5.723MiB / 62.78GiB 0.01% 1.61kB / 0B 0B / 729kB 38
1921062365b8 docker_run-narwhal-prim2-1 124.54% 6.426MiB / 62.78GiB 0.01% 10kB / 7.47kB 0B / 745kB 38
2c838b3120ab docker_run-narwhal-prim0-1 123.08% 6.496MiB / 62.78GiB 0.01% 11.5kB / 11.2kB 16.4kB / 737kB 38
46e4f43bebbe docker_run-narwhal-prim3-worker0-1 0.03% 5.797MiB / 62.78GiB 0.01% 1.44kB / 0B 0B / 721kB 38
0558a0aec49b docker_run-narwhal-prim0-worker0-1 0.02% 5.812MiB / 62.78GiB 0.01% 1.61kB / 0B 0B / 737kB 38
Full logs here I have also not observed any issues in stopping the containers (through a simple Ctrl-C).
I note that Mac OSX on new M1s is pretty conservative on real concurrency: using more than the available number of red threads (8 or 10 depending on config) can get a process killed. I believe cgroups allocate red threads to each container, and I don't know the underlying details of the virtualization that underpins Docker desktop in OSX.
Have you considered running on a slightly beefier Linux machine on EC2?
Here is one relevant log line:
docker_run-narwhal-prim0-1 | 2022-08-22T14:58:36.423629542Z WARN primary::core: Authority /iIye00hpCVaUQlM5AQ3s7OtO3Fc4Z2SPzzmVVHIQbM= appears in quorum more than once
That looks like an issue in our certificate production, and I'll try to chase it down.
Re: MYSTEN_DB_WRITE_BUFFER_SIZE_MB
and MYSTEN_DB_WAL_SIZE_MB
:
- Those env vars are read here: https://github.com/MystenLabs/mysten-infra/blob/8d090689be14078f2ca41c356e7bbc0af21f73ab/crates/typed-store/src/rocks/mod.rs#L441-L478
- Though the messages are scary, those warnings should not be fatal and will just lead to your system adopting the default values.
additionally, the mounted logs never produce. Looking at the docker compose, we're not teeing the log output to a file, perhaps that's why?
the logs are going to stdout/stderr for collection by the Loki (promtail) log collection process.
Is this still a problem? The initial reported error seems to have been related to disk issues. Did you try docker compose down -v
to delete the volumes before bringing it back up?