sui icon indicating copy to clipboard operation
sui copied to clipboard

Running docker containers in docker-compose consistently fails on node 4 of 4

Open jsteenb2 opened this issue 2 years ago • 7 comments

Description

I've set up a docker-compose file in conjuction with TM nodes. Without fail, the last worker node from the last narwhal node, fails to setup rocks db and immediately shuts down after a failed startup.

docker_run-narwhal-prim3-worker0-1  | rm: cannot remove '/validators/validator-2/db-worker-0/OPTIONS-000011': No such file or directory
docker_run-narwhal-prim3-worker0-1  | Bootstrapping new worker node with id 0
docker_run-narwhal-prim3-worker0-1  | 2022-08-15T15:40:06.295497506Z ERROR open_cf{path="/validators/validator-2/db-worker-0" cf=["headers", "certificates", "payload", "batches", "last_committed", "sequence"]}:open_cf_opts{path="/validators/validator-2/db-worker-0"}: typed_store::rocks: error=rocksdb error: IO error: No such file or directory: While opening a file for sequentially reading: /validators/validator-2/db-worker-0/MANIFEST-000001: No such file or directory
docker_run-narwhal-prim3-worker0-1  | 2022-08-15T15:40:06.295522298Z ERROR open_cf{path="/validators/validator-2/db-worker-0" cf=["headers", "certificates", "payload", "batches", "last_committed", "sequence"]}: typed_store::rocks: error=rocksdb error: IO error: No such file or directory: While opening a file for sequentially reading: /validators/validator-2/db-worker-0/MANIFEST-000001: No such file or directory
docker_run-narwhal-prim3-worker0-1  | thread 'main' panicked at 'Cannot open database: RocksDBError("IO error: No such file or directory: While opening a file for sequentially reading: /validators/validator-2/db-worker-0/MANIFEST-000001: No such file or directory")', /narwhal/node/src/lib.rs:70:10
docker_run-narwhal-prim3-worker0-1  | stack backtrace:
docker_run-narwhal-prim3-worker0-1  |    0: rust_begin_unwind
docker_run-narwhal-prim3-worker0-1  |              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
docker_run-narwhal-prim3-worker0-1  |    1: core::panicking::panic_fmt
docker_run-narwhal-prim3-worker0-1  |              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
docker_run-narwhal-prim3-worker0-1  |    2: core::result::unwrap_failed
docker_run-narwhal-prim3-worker0-1  |              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/result.rs:1785:5
docker_run-narwhal-prim3-worker0-1  |    3: node::NodeStorage::reopen
docker_run-narwhal-prim3-worker0-1  |    4: node::run::{{closure}}
docker_run-narwhal-prim3-worker0-1  |    5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
docker_run-narwhal-prim3-worker0-1  |    6: node::main
docker_run-narwhal-prim3-worker0-1  | note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
docker_run-narwhal-prim3-worker0-1 exited with code 101
  • additionally, other nodes fail to start b/c of other files not existing temporarily, and the process/container exits b/c of it

Steps to reproduce

  • first run make build_docker_tm from this make file https://github.com/PeggyJV/tendermint/blob/feat/narwhal/mempool/narwhal/internal/Makefile#L9
  • run docker compose up from this at the directory of htis compose file: https://github.com/PeggyJV/tendermint/blob/feat/narwhal/mempool/narwhal/internal/narwhalmint/docker_run/docker-compose.yml
  • should fail on startup after docker compose down && docker compose up, without fail

Note

I'm running on a m1 pro macbook pro with Docker Desktop 4.11.1 (84025), which is currently the newest version available.

Additionally, the tendermint integration with the worker node, is using the stream API to submit txs.

jsteenb2 avatar Aug 15 '22 15:08 jsteenb2

another issue reared its head with the docker containers. There are times when shutting down teh container, docker fails to shut it down, failling to receive an exit event:

  > docker compose down                                                                                                                                                                              10:27:28
[+] Running 4/8
 ⠿ Container docker_run-tm-node2-1               Removed                                                                                                                                                 0.0s
 ⠿ Container docker_run-tm-node0-1               Removed                                                                                                                                                 0.0s
 ⠿ Container docker_run-tm-node3-1               Removed                                                                                                                                                 0.0s
 ⠿ Container docker_run-tm-node1-1               Removed                                                                                                                                                 0.0s
 ⠿ Container docker_run-narwhal-prim3-worker0-1  Error while Stopping                                                                                                                                   24.1s
 ⠿ Container docker_run-narwhal-prim1-worker0-1  Error while Stopping                                                                                                                                   24.1s
 ⠿ Container docker_run-narwhal-prim0-worker0-1  Error while Stopping                                                                                                                                   24.1s
 ⠿ Container docker_run-narwhal-prim2-worker0-1  Error while Stopping                                                                                                                                   24.1s
Error response from daemon: cannot stop container: f52fc8ad73264bf27a22c78c4d4cd2f0a45d6a113da174cd21d1b260d3c248ff: tried to kill container, but did not receive an exit event
> docker rm -f $(docker ps -aq)                                                                                                                                                                    10:28:27
4264b4ad593b
Error response from daemon: Could not kill running container f735379bf85b7ff8ace15332ccb744990f1dbbd815e3dde15862a7ee63763977, cannot remove - tried to kill container, but did not receive an exit event
Error response from daemon: Could not kill running container 987c023626c933aa74c8c280190dfcf523d3b1427d65db9defb35b980ea222ca, cannot remove - tried to kill container, but did not receive an exit event
Error response from daemon: Could not kill running container f52fc8ad73264bf27a22c78c4d4cd2f0a45d6a113da174cd21d1b260d3c248ff, cannot remove - tried to kill container, but did not receive an exit event
Error response from daemon: Could not kill running container 488a905332a9629ae759d754ff8e5c567301b6e0ab2aa497cc2c0293920deb7f, cannot remove - tried to kill container, but did not receive an exit event
Error response from daemon: Could not kill running container 46fd11167d9a673cffe2302f35634e1f0dee2a03480ea24de7e2015f1892f6a5, cannot remove - tried to kill container, but did not receive an exit event
Error response from daemon: Could not kill running container 33ed262954b30e109b7e28244f3e74906066cc6d693313a1280b8f76eb443687, cannot remove - tried to kill container, but did not receive an exit event
Error response from daemon: Could not kill running container 9ed6e0e91bbf51271c3924692083bf7db36a2a09055efa92afe811c88f2f0bde, cannot remove - tried to kill container, but did not receive an exit event

when this happens (it is relatively infrequent, but a royal pain).... you have to hard kill docker itself and then cleanup the mess around it.

jsteenb2 avatar Aug 16 '22 15:08 jsteenb2

additionally, the mounted logs never produce. Looking at the docker compose, we're not teeing the log output to a file, perhaps that's why?

jsteenb2 avatar Aug 16 '22 15:08 jsteenb2

just rebased from master(commit: e54611ee0387735b1963eecfc4fdc120ee8ac509), entire docker-compose setup within narwhal doesn't work anymore.... Looks to be some env var not present that is expected? Now all the primaries exhibit this and exit status code 1 almost immediately after start:

2022-08-16T17:34:42.788472971Z  INFO open_cf{path="/validators/validator-2/db-primary" cf=["headers", "certificates", "payload", "batches", "last_committed", "sequence", "temp_batches"]}: typed_store::rocks: Env var MYSTEN_DB_WRITE_BUFFER_SIZE_MB is not set: environment variable not found
2022-08-16T17:34:42.788496846Z  INFO open_cf{path="/validators/validator-2/db-primary" cf=["headers", "certificates", "payload", "batches", "last_committed", "sequence", "temp_batches"]}: typed_store::rocks: Env var MYSTEN_DB_WAL_SIZE_MB is not set: environment variable not found
2022-08-16T17:34:42.793563596Z  INFO open_cf{path="/validators/validator-2/db-primary" cf=["headers", "certificates", "payload", "batches", "last_committed", "sequence", "temp_batches"]}:open_cf_opts{path="/validators/validator-2/db-primary"}: typed_store::rocks: Env var MYSTEN_DB_WRITE_BUFFER_SIZE_MB is not set: environment variable not found
2022-08-16T17:34:42.795720387Z  INFO open_cf{path="/validators/validator-2/db-primary" cf=["headers", "certificates", "payload", "batches", "last_committed", "sequence", "temp_batches"]}:open_cf_opts{path="/validators/validator-2/db-primary"}: typed_store::rocks: Env var MYSTEN_DB_WAL_SIZE_MB is not set: environment variable not found

its unclear to me what's causing the failures. If those env vars are not set..... perhaps assume a default value?

moving back to commit: 2de9cd02 and I'm able to run TM/narwhal together again. The only issue in this branch is the OG issue of this issue.

jsteenb2 avatar Aug 16 '22 17:08 jsteenb2

PR to update the logging here: https://github.com/MystenLabs/narwhal/pull/803

jsteenb2 avatar Aug 16 '22 18:08 jsteenb2

I do not manage to reproduce this issue, though I must admit the machine I tried this on is pretty beefy (bare metal Linux, 16 cores, 64GB memory). I tried commit 6103f64d0e5dbe6e16e39554e5be27de693e3501 on this repo.

Here's the docker stats of all working containers:

CONTAINER ID   NAME                                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
b3876cde8a62   docker_run-narwhal-prim1-1           123.52%   6.559MiB / 62.78GiB   0.01%     12.6kB / 11.4kB   0B / 745kB       38
c4e3ddbd27a3   docker_run-narwhal-prim3-1           121.82%   6.484MiB / 62.78GiB   0.01%     9.75kB / 6.46kB   0B / 745kB       38
f597eb04c0ac   docker_run-narwhal-prim1-worker0-1   0.02%     5.926MiB / 62.78GiB   0.01%     2.53kB / 0B       0B / 729kB       38
d77a27057bcf   docker_run-narwhal-prim2-worker0-1   0.02%     5.723MiB / 62.78GiB   0.01%     1.61kB / 0B       0B / 729kB       38
1921062365b8   docker_run-narwhal-prim2-1           124.54%   6.426MiB / 62.78GiB   0.01%     10kB / 7.47kB     0B / 745kB       38
2c838b3120ab   docker_run-narwhal-prim0-1           123.08%   6.496MiB / 62.78GiB   0.01%     11.5kB / 11.2kB   16.4kB / 737kB   38
46e4f43bebbe   docker_run-narwhal-prim3-worker0-1   0.03%     5.797MiB / 62.78GiB   0.01%     1.44kB / 0B       0B / 721kB       38
0558a0aec49b   docker_run-narwhal-prim0-worker0-1   0.02%     5.812MiB / 62.78GiB   0.01%     1.61kB / 0B       0B / 737kB       38

Full logs here I have also not observed any issues in stopping the containers (through a simple Ctrl-C).

I note that Mac OSX on new M1s is pretty conservative on real concurrency: using more than the available number of red threads (8 or 10 depending on config) can get a process killed. I believe cgroups allocate red threads to each container, and I don't know the underlying details of the virtualization that underpins Docker desktop in OSX.

Have you considered running on a slightly beefier Linux machine on EC2?

Here is one relevant log line:

docker_run-narwhal-prim0-1          | 2022-08-22T14:58:36.423629542Z  WARN primary::core: Authority /iIye00hpCVaUQlM5AQ3s7OtO3Fc4Z2SPzzmVVHIQbM= appears in quorum more than once

That looks like an issue in our certificate production, and I'll try to chase it down.

Re: MYSTEN_DB_WRITE_BUFFER_SIZE_MB and MYSTEN_DB_WAL_SIZE_MB:

  • Those env vars are read here: https://github.com/MystenLabs/mysten-infra/blob/8d090689be14078f2ca41c356e7bbc0af21f73ab/crates/typed-store/src/rocks/mod.rs#L441-L478
  • Though the messages are scary, those warnings should not be fatal and will just lead to your system adopting the default values.

huitseeker avatar Aug 22 '22 15:08 huitseeker

additionally, the mounted logs never produce. Looking at the docker compose, we're not teeing the log output to a file, perhaps that's why?

the logs are going to stdout/stderr for collection by the Loki (promtail) log collection process.

allan-bailey avatar Sep 19 '22 20:09 allan-bailey

Is this still a problem? The initial reported error seems to have been related to disk issues. Did you try docker compose down -v to delete the volumes before bringing it back up?

allan-bailey avatar Sep 19 '22 20:09 allan-bailey