openraft icon indicating copy to clipboard operation
openraft copied to clipboard

Example raft-kv-memstore hangs after printing change-membership

Open vishwaspai opened this issue 3 years ago • 17 comments

Took a latest checkout of main (commit hash: 347aca11c913b814bba77cfee6f9635c03b353e3) and ran raft-kv-memstore$ ./test-cluster.sh

But it hangs after reaching this place..

... Changing membership from [1] to 3 nodes cluster: [1, 2, 3]

--- rpc(:21001/change-membership, [1, 2, 3])

I see that 3 process are running and first process is using 100% CPU and other 2 are almost idle. Could not access discord channel (for some reason it does not open) hence reporting it here. If required I can share any specific logs. p.s: Example - raft-kv-rocksdb works fine.

vishwaspai avatar Sep 09 '22 10:09 vishwaspai

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

github-actions[bot] avatar Sep 09 '22 10:09 github-actions[bot]

Did some debugging. As soon as node-2 gets added (via add-learner), leader (node-1) starts sending raft-append rpc to node-2 aggressively. A this time cluster has only 1 leader and 1 learner. I think this overwhelms either node-1 or node-2 and things do not work after that.

If I change the config in this example (see below) to same as rocksdb, things work fine. I still see raft-append on node-2, but only few messages per second.

let config = Arc::new(Config {
    heartbeat_interval: 250,
    ..Default::default()
});

Why does it go wrong with default config of heartbeat_interval of 50 ?

vishwaspai avatar Sep 10 '22 01:09 vishwaspai

The logs will be helpful. After running ./test-cluster.sh, there should be 3 logs in dir examples/raft-kv-memstore: n1.log n2.log n3.log.

Attach these logs, please and let me look at what's happening.

Thanks for letting me know about this issue:D

drmingdrmer avatar Sep 10 '22 13:09 drmingdrmer

Attached logs

raft-kv-memstore-logs.tar.gz

vishwaspai avatar Sep 10 '22 13:09 vishwaspai

Well... I did not see anything wrong in the logs, except a lot timeout:

  • Node-1 keeps failing to send append-entries RPC, you can find logs like handle_update_matched target=2 result=Err("deadline has elapsed") since 2022-09-10T13:50:19Z, in n1.log(the leader).
  • Node-2 received append-entries requests and dealt with them correctly, but it failed to respond before the 50ms timeout. You can find logs like append-entries request vote=vote:1-1 prev_log_id=1-1-63 entries=1-1-64:blank ... 1-1-175:blank leader_committed=1-1-138 in n2.log(a learner).

It looks just a simple timeout issue. Were you using a slow computer running this test? 🤔

And as far as I know, extending the heartbeat interval will just solve this issue, as you mentioned.:)

drmingdrmer avatar Sep 12 '22 03:09 drmingdrmer

Heartbeat interval affects the timeout: an append-entries RPC should not take longer than the interval between 2 heartbeats.

drmingdrmer avatar Sep 12 '22 03:09 drmingdrmer

It looks just a simple timeout issue. Were you using a slow computer running this test?

No. I'm running on i7-12700H with 14 cores.

vishwaspai avatar Sep 12 '22 04:09 vishwaspai

I did not yet find what was going on from the log.

I updated the main branch to let examples output a more detailed log. May you rerun the test with the latest main branch ac4830923bb828288f8a33538991a2964658821a? And let's see what's going on.

And what's your OS and the rust-toolchain?

And may you attach the examples/raft-kv-memstore/Cargo.lock that is built when running test-cluster.sh? So that I can check the dependency crate versions.

drmingdrmer avatar Sep 13 '22 06:09 drmingdrmer

Added all the information in the attached tar.

raft-kv-memstore-logs-ac48309.tar.gz

vishwaspai avatar Sep 13 '22 06:09 vishwaspai

Do you have a proxy installed on your machine? 🤔

2022-09-13T06:27:43.448747Z DEBUG ThreadId(01) new{id=1 cluster=foo}:RaftCore{id=1 cluster=foo}:replication{id=1 target=2}:main{vote=vote:1-1 target=2 cluster=foo}:line_rate_loop:send_append_entries: start sending append_entries, timeout: 50 payload=vote=vote:1-1, prev_log_id=1-1-82, leader_commit=1-1-83, entries=1-1-83:blank

2022-09-13T06:27:43.482663Z DEBUG ThreadId(01) new{id=1 cluster=foo}:RaftCore{id=1 cluster=foo}:replication{id=1 target=2}:main{vote=vote:1-1 target=2 cluster=foo}:line_rate_loop:send_append_entries: starting new connection: http://127.0.0.1:21002/    

The first log is printed by openraft at: https://github.com/datafuselabs/openraft/blob/feb9ef7c463e78ab8b3014aeb61e2b117257dca2/openraft/src/replication/mod.rs#L326-L335

The second log is printed by reqwest: https://github.com/seanmonstar/reqwest/blob/8b37ae4b15f668e3e3f9b8cf43b547fc4f95e444/src/connect.rs#L560

You can see that every log with starting new connection... is delayed by about 35ms in n1.log with:

cat n1.log | grep 'starting new connection' -B1 | grep -v '^\-\-' | awk '{v=$1; gsub("2022-09-13T..:..:",   "",   v); gsub("Z",  "",  v); print (v-a)*1000 "ms "; a=v}'

Output:

...
1.318ms
35.363ms
1.182ms
34.744ms
1.313ms
35.149ms
1.329ms
35.211ms

reqwest::Connector tries every proxy. Can you help me check if there is a proxy on your machine that maybe delay every RPC message by 35ms?

drmingdrmer avatar Sep 13 '22 10:09 drmingdrmer

Do you have a proxy installed on your machine?

No.

vishwaspai avatar Sep 13 '22 10:09 vishwaspai

It looks like it is not an issue with openraft, but an issue at the network layer.

I'm not an expert on reqwest and can not figure out what's possibly wrong. :(

Can you do a single-step debugging of it on your machine? Running this test does almost the same as test-cluster.sh. You can get started with it: https://github.com/datafuselabs/openraft/blob/main/examples/raft-kv-memstore/tests/cluster/test_cluster.rs

Another concern is whether a normal curl will be delayed by 35ms on your machine. To confirm that, try the latest main branch, I added timing to every curl in test-cluster.sh: 59cc9e86e92364334f95587440665aeef947aec2

drmingdrmer avatar Sep 13 '22 14:09 drmingdrmer

For now, I ran with -w '\nConnect: %{time_connect}s\nXfer : %{time_starttransfer}s\nTotal : %{time_total}s\n' to curl (instead of time) and following is the result.

Based on the results I see that add-learner is taking time. Other curl requests fast. At least, I don't see this as networking issue. Anyways, I'll see if I can spend time on single stepping.

Initialize server 1 as a single-node cluster

--- rpc(:21001/init, {})  => 3ms
Connect: 0.000067s
Xfer   : 0.003601s
Total  : 0.003638s


Server 1 is a leader now
Get metrics from the leader

--- rpc(:21001/metrics, ) => 0.9ms
Connect: 0.000067s
Xfer   : 0.000851s
Total  : 0.000866s


Adding node 2 and node 3 as learners, to receive log from leader node 1

--- rpc(:21001/add-learner, [2, "127.0.0.1:21002"]) =>  157ms
Connect: 0.000203s
Xfer   : 0.157091s
Total  : 0.157123s


Node 2 added as learner

--- rpc(:21001/add-learner, [3, "127.0.0.1:21003"]) => 540ms
Xfer   : 0.539554s
Total  : 0.539581s


Node 3 added as learner
Get metrics from the leader, after adding 2 learners

--- rpc(:21001/metrics, ) => 0.7 ms
Connect: 0.000045s
Xfer   : 0.000678s
Total  : 0.000693s

Changing membership from [1] to 3 nodes cluster: [1, 2, 3]

--- rpc(:21001/change-membership, [1, 2, 3])
...

vishwaspai avatar Sep 14 '22 01:09 vishwaspai

Based on the results I see that add-learner is taking time

When adding a learner, the leader will block until the new learner catches up with the logs, thus it takes a longer time, which should be expected behavior.

drmingdrmer avatar Sep 14 '22 01:09 drmingdrmer

Could not find much with additional debugging. With heartbeat of above 100ms things work fine. Below this, I see that the raft-append rpc timesout. I tried some tweaking (number of workers etc) on actix_web, but I do not much change in behavior.

vishwaspai avatar Sep 15 '22 14:09 vishwaspai

Can I do something else to help ? Is this problem specific to my environment ?

vishwaspai avatar Sep 20 '22 09:09 vishwaspai

Can I do something else to help ? Is this problem specific to my environment ?

The problem happens after calling send_append_entries() and before calling reqwest: https://github.com/seanmonstar/reqwest/blob/8b37ae4b15f668e3e3f9b8cf43b547fc4f95e444/src/connect.rs#L560

As I mentioned in a previous comment: https://github.com/datafuselabs/openraft/issues/550#issuecomment-1245189859

I still can not reproduce the 35ms delay on my laptop or in a CI. Maybe you can find out what's going on in your machine.

Thank you in advance!

drmingdrmer avatar Sep 21 '22 01:09 drmingdrmer

I am experiencing the same issue, the test just hangs after showing the message of membership changing, there is 0ms of delay as I have no proxies or anything configured in this computer. The cluster_test.rs also hangs in the same point.

As @vishwaspai if I increase the heartebeat_timeout to some value higher than 100 it works (the test fails as the last read from node 3 returns an empty string instead of bar in the test but t works fine in the shell script). I have no clue what is going on but the 100 value barrier feels quite suspicious to me, will try to test in another environments.

Attached logs of the three nodes plus my Cargo.lock in case it can be of any help

raft-kv-memstore-logs-and-lock.tar.gz

DamnWidget avatar Nov 21 '22 12:11 DamnWidget

Definitely is something wrong with the environment. I ran it in another environment with zero issues, this are my details:

Legend:

  • Env1: Pop_OS! 22.04 LTS with Linux Kernel 6.0.6, AMD Ryzen 7 4800H CPU
  • Env2: Artix Linux with Linux Kernel 5.15.26-lts, Intel i7-6700K CPU

I will try in another computer with also Artix Linux kernel 6.0.x and AMD 3950X CPU in a few minutes

DamnWidget avatar Nov 21 '22 12:11 DamnWidget

Third test environment:

  • Artix Linux with Kernel 6.0.6 (zen flavor) and AMD 3950X CPU

No issues. Probably something with Ubuntu, will try to run on this last env with a live distro or VM if I find the time

DamnWidget avatar Nov 21 '22 12:11 DamnWidget

Thank you so much @DamnWidget !

drmingdrmer avatar Nov 21 '22 12:11 drmingdrmer

It happened recently in a CI session: https://github.com/datafuselabs/openraft/actions/runs/3649785637/jobs/6164911737

~This issue affects memstore example, but the rocksdb example works fine. I think the problem is with actix_web, which is used by memstore example, while rocksdb uses toy_rpc as its server.~

Update: My mistake. It's because the rocksdb example is configured with a greater heartbeat timeout.

drmingdrmer avatar Dec 09 '22 02:12 drmingdrmer

In all of my tests it happened if the hb was configured to any value lower than 101 milliseconds, it also only happened in Pop_OS! when I tried, unfortunately I didn't find the time to try with other Ubuntu or Debían based distributions, it definitely wasn't a problem in Artix

DamnWidget avatar Dec 09 '22 09:12 DamnWidget

This problem is caused by an issue of cargo that affects crate openssl.

  1. In the example example/raft-key-value-memstore, it uses reqwest to send an RPC to raft service. And reqwest depends on native-tls, while native-tls depends on openssl.

  2. When calling openssl::SslConnector::builder(): it will load certificate files if env variable SSL_CERT_FILE is set: https://github.com/sfackler/rust-openssl/blob/eaee383429d156bd91c4a188ba57cf1747c2e440/openssl/src/ssl/mod.rs#L889-L896

    strace shows the time spent on loading certificates:

    1670588693.413004 (+     0.000084) openat(AT_FDCWD, "/usr/lib/ssl/certs/ca-certificates.crt", O_RDONLY) = 3
    1670588693.413054 (+     0.000049) newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=190243, ...}, AT_EMPTY_PATH) = 0
    1670588693.413093 (+     0.000038) read(3, "-----BEGIN CERTIFICATE-----\nMIIH"..., 4096) = 4096
    1670588693.413790 (+     0.000697) read(3, "8B****************************vJw"..., 4096) = 4096
    ...
    1670588693.474642 (+     0.001588) read(3, "D\n*****************************T"..., 4096) = 4096
    1670588693.474848 (+     0.000205) brk(0x55e121d5d000) = 0x55e121d5d000
    1670588693.474888 (+     0.000039) brk(0x55e121d5c000) = 0x55e121d5c000
    1670588693.478875 (+     0.001058) read(3, "", 4096) = 0
    
  3. And on Linux(at least on ubuntu), command cargo run pollutes environment variables: It leaks SSL_CERT_FILE and SSL_CERT_DIR to the testing sub progress it runs.

Finally, every time a reqwest::Client is created, it spends several dozen milliseconds on loading a bunch of certificates, which times out a raft RPC(50 ms).

I created a mini demo showing that on linux, cargo xxx will slow down openssl based program: https://github.com/drmingdrmer/test-native-tls-rs

On linux:

cargo run

SSL_CERT_DIR: /usr/lib/ssl/certs
SSL_CERT_FILE: /usr/lib/ssl/certs/ca-certificates.crt
...
=== : 66.382982ms

On my m1 mac:

cargo run

...
=== : 9.23432ms

Related issues:

  • Works under cargo test, but not when harness is executed directly? https://github.com/sfackler/rust-openssl/issues/575

  • Cargo appears to leak SSL_CERT_FILE and SSL_CERT_DIR to subprocesses https://github.com/rust-lang/cargo/issues/3676

  • Cargo binaries are not run in pristine environment https://github.com/rust-lang/cargo/issues/2888

drmingdrmer avatar Dec 09 '22 15:12 drmingdrmer

Nice catch, in Artix don't happens probably because it uses a more modern version of OpenSSL. Will try to check as soon as I can get into the computer

DamnWidget avatar Dec 09 '22 17:12 DamnWidget