bor icon indicating copy to clipboard operation
bor copied to clipboard

bor - sync stops and resumes after ~30 seconds

Open andreicupaciu opened this issue 1 year ago • 9 comments

Hi!

I'm having sync issues on 3 bor nodes. Each 3 - 4 minutes, each node stops syncing for couple of seconds. Issues appeared ~ 2 weeks ago. Nodes are up since ~ 2 years ago, no issues until now. All nodes are in EU, having identical hardware:

NVMEs used for bor database. Strong AMD CPUs, 256 GB Ram each. bor runs inside docker container, having 32 cpus / 128GB Ram allocated. NVMEs are not shared with other projects.

Client versions: bor: 1.1.0 (2 nodes) & 1x 1.2.3 (1 node) heimdall: 1.0.3

Errors seen on bor during sync stop:

WARN [01-30|07:18:30.964] an error while trying fetching from Heimdall path="/milestone/ID/9839a926-38ee-460c-9b19-05d7bbbc31cd - 0xa1dff281c5b731b6da4d384055de73f683902000" attempt=1 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:18:50.966] an error while trying fetching from Heimdall path="/milestone/ID/9839a926-38ee-460c-9b19-05d7bbbc31cd - 0xa1dff281c5b731b6da4d384055de73f683902000" attempt=5 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:19:34.309] an error while trying fetching from Heimdall path="/milestone/ID/53007ce6-46c3-4905-9778-bd278cd71623 - 0x1ab64376e7369d35b3921d0d7114a61a8114b9a7" attempt=1 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:19:54.311] an error while trying fetching from Heimdall path="/milestone/ID/53007ce6-46c3-4905-9778-bd278cd71623 - 0x1ab64376e7369d35b3921d0d7114a61a8114b9a7" attempt=5 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:20:04.211] an error while trying fetching from Heimdall path="/milestone/ID/73e21045-3b94-4be1-a1ee-c2bd2682b6a6 - 0x01805ff9d0a87650f52de7d12a80d75a995cf3b7" attempt=1 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:20:22.876] an error while trying fetching from Heimdall path="/milestone/ID/73e21045-3b94-4be1-a1ee-c2bd2682b6a6 - 0x01805ff9d0a87650f52de7d12a80d75a995cf3b7" attempt=1 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:20:51.873] an error while trying fetching from Heimdall path="/milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521 - 0x4099f90e3f700b78bf98fa43617b2ce8f40f48a8" attempt=1 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:21:11.876] an error while trying fetching from Heimdall path="/milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521 - 0x4099f90e3f700b78bf98fa43617b2ce8f40f48a8" attempt=5 error="error while fetching data from Heimdall: response code 500"

During this time, bor stops syncing and following log appears:

WARN [01-30|07:25:16.510] Served bor_getVoteOnHash conn=192.168.0.3:46018 reqid=6309 duration=29.90012752s err="Milestone ID doesn't exist in Heimdall" INFO [01-30|07:25:16.513] Got new milestone from heimdall start=52,915,439 end=52,915,458 hash=0xb30a388c47a0422624b9ec3ba72d03a16560e8f258d287f544d28c3527dc7ce0

After mentioned logs, bor eth.syncing returns true and it catches up (~ 20 - 30 blocks behind) This happens each 3 - 5 minutes. Each time, bor fells behind with ~ 20 - 30 blocks.

On heimdallr, I see GETs like:

x.x.x.x - - [30/Jan/2024:07:20:51 +0000] "GET /milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521%20-%200x4099f90e3f700b78bf98fa43617b2ce8f40f48a8 HTTP/1.1" 500 173 "-" "Go-http-client/1.1" "-" x.x.x.x - - [30/Jan/2024:07:20:56 +0000] "GET /milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521%20-%200x4099f90e3f700b78bf98fa43617b2ce8f40f48a8 HTTP/1.1" 500 173 "-" "Go-http-client/1.1" "-" x.x.x.x - - [30/Jan/2024:07:21:01 +0000] "GET /milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521%20-%200x4099f90e3f700b78bf98fa43617b2ce8f40f48a8 HTTP/1.1" 500 173 "-" "Go-http-client/1.1" "-" x.x.x.x - - [30/Jan/2024:07:21:06 +0000] "GET /milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521%20-%200x4099f90e3f700b78bf98fa43617b2ce8f40f48a8 HTTP/1.1" 500 173 "-" "Go-http-client/1.1" "-" x.x.x.x - - [30/Jan/2024:07:21:11 +0000] "GET /milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521%20-%200x4099f90e3f700b78bf98fa43617b2ce8f40f48a8 HTTP/1.1" 500 173 "-" "Go-http-client/1.1" "-" x.x.x.x - - [30/Jan/2024:07:21:16 +0000] "GET /milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521%20-%200x4099f90e3f700b78bf98fa43617b2ce8f40f48a8 HTTP/1.1" 500 173 "-" "Go-http-client/1.1" "-"

andreicupaciu avatar Jan 30 '24 07:01 andreicupaciu

Can you please check whether Heimdall node is working fine or not? And Heimdall URL value is correct in Bor config.

VAIBHAVJINDAL3012 avatar Feb 01 '24 08:02 VAIBHAVJINDAL3012

Yes, everything is correct in bor config / heimdall config. Again, issue is temporary:

After this last error: WARN [01-30|07:25:16.510] Served bor_getVoteOnHash conn=192.168.0.3:46018 reqid=6309 duration=29.90012752s err="Milestone ID doesn't exist in Heimdall"

Bor gets new miletone from heimdall after sending this get request to heimdall: x.x.x.x - - [01/Feb/2024:11:38:43 +0000] "GET /milestone/latest HTTP/1.1" 200 351 "-" "Go-http-client/1.1" "-"

Bor logs: INFO [01-30|07:25:16.513] Got new milestone from heimdall start=52,915,439 end=52,915,458 hash=0xb30a388c47a0422624b9ec3ba72d03a16560e8f258d287f544d28c3527dc7ce0

Bor starts to sync again (eth.sync= true), being 20 - 30 blocks behind.

andreicupaciu avatar Feb 01 '24 11:02 andreicupaciu

Heimdallr error observed:

INFO [2024-02-02|06:48:56.498] Served RPC HTTP response module=rest-server method=GET url=/milestone/ID/4d57f1d5-1eb2-4925-a299-9a0c7308ffaa%20-%200xed83ea13b553951c6522a22e87d65766dea4d3d3 status=500 duration=0 remoteAddr=x.x.x.x:57696 ERROR[2024-02-02|06:48:56.508] Panic in RPC HTTP handler module=rest-server err="interface conversion: interface {} is nil, not string" stack="goroutine 841 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x65\ngithub.com/maticnetwork/heimdall/server.recoverAndLog.func1.1()\n\t/go/src/github.com/maticnetwork/heimdall/server/root.go:146 +0x14f\npanic({0x1a27c00, 0xc000641b00})\n\t/usr/local/go/src/runtime/panic.go:884 +0x213\ngithub.com/maticnetwork/heimdall/checkpoint/types.GetMilestoneID(...)\n\t/go/src/github.com/maticnetwork/heimdall/checkpoint/types/cache.go:10\ngithub.com/maticnetwork/heimdall/checkpoint/client/rest.milestoneByIDHandlerFn.func1({0x26aa2b0, 0xc0005d12d8}, 0xc000641a70?)\n\t/go/src/github.com/maticnetwork/heimdall/checkpoint/client/rest/query_milestone.go:255 +0x405\nnet/http.HandlerFunc.ServeHTTP(0xc000ada500?, {0x26aa2b0?, 0xc0005d12d8?}, 0x7f98c9b4a168?)\n\t/usr/local/go/src/net/http/server.go:2122 +0x2f\ngithub.com/gorilla/mux.(*Router).ServeHTTP(0xc000001140, {0x26aa2b0, 0xc0005d12d8}, 0xc000ada400)\n\t/root/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210 +0x1cf\ngithub.com/maticnetwork/heimdall/server.maxBytesHandler.ServeHTTP({{0x269a560?, 0xc000001140?}, 0x40fe27?}, {0x26aa2b0?, 0xc0005d12d8}, 0xc000ada400)\n\t/go/src/github.com/maticnetwork/heimdall/server/root.go:120 +0x124\ngithub.com/maticnetwork/heimdall/server.recoverAndLog.func1({0x26ac080?, 0xc00067a700}, 0xc000ada400)\n\t/go/src/github.com/maticnetwork/heimdall/server/root.go:165 +0x326\ngithub.com/maticnetwork/heimdall/server.startRPCServer.func1({0x26ac080, 0xc00067a700}, 0xc000ada400)\n\t/go/src/github.com/maticnetwork/heimdall/server/root.go:188 +0xe2\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0x26ac080?, 0xc00067a700?}, 0x47010e?)\n\t/usr/local/go/src/net/http/server.go:2122 +0x2f\nnet/http.serverHandler.ServeHTTP({0xc000641920?}, {0x26ac080, 0xc00067a700}, 0xc000ada400)\n\t/usr/local/go/src/net/http/server.go:2936 +0x316\nnet/http.(*conn).serve(0xc000a9f170, {0x26ad168, 0xc000c7e0c0})\n\t/usr/local/go/src/net/http/server.go:1995 +0x612\ncreated by net/http.(*Server).Serve\n\t/usr/local/go/src/net/http/server.go:3089 +0x5ed\n"

andreicupaciu avatar Feb 02 '24 06:02 andreicupaciu

Are you running a validator?

VAIBHAVJINDAL3012 avatar Feb 08 '24 11:02 VAIBHAVJINDAL3012

No, full nodes for RPC purposes.

andreicupaciu avatar Feb 08 '24 11:02 andreicupaciu

Additional info: when using public heimdall endpoint ("https://heimdall-api.polygon.technology), sync works as expected.

heimdall config.toml:

proxy_app = "tcp://127.0.0.1:26658" moniker = "e6b94a5a8355" fast_sync = true db_backend = "goleveldb" db_dir = "data" log_level = "main:info,state:info,statesync:info,:error" log_format = "plain" genesis_file = "config/genesis.json" priv_validator_key_file = "config/priv_validator_key.json" priv_validator_state_file = "data/priv_validator_state.json" priv_validator_laddr = "" node_key_file = "config/node_key.json" abci = "socket" filter_peers = false [rpc] laddr = "tcp://127.0.0.1:26657" cors_allowed_origins = [""] cors_allowed_methods = ["HEAD", "GET", "POST", ] cors_allowed_headers = ["Origin", "Accept", "Content-Type", "X-Requested-With", "X-Server-Time", ] grpc_laddr = "" grpc_max_open_connections = 900 unsafe = false max_open_connections = 900 max_subscription_clients = 100 max_subscriptions_per_client = 5 timeout_broadcast_tx_commit = "10s" max_body_bytes = 1000000 max_header_bytes = 1048576 tls_cert_file = "" tls_key_file = "" pprof_laddr = "localhost:6060" [p2p] laddr = "tcp://0.0.0.0:26656" external_address = "tcp://x.x.x.x:26656" seeds = "[email protected]:26656,[email protected]:26656" persistent_peers = "" upnp = false addr_book_file = "config/addrbook.json" addr_book_strict = true max_num_inbound_peers = 40 max_num_outbound_peers = 10 unconditional_peer_ids = "" persistent_peers_max_dial_period = "0s" flush_throttle_timeout = "100ms" max_packet_msg_payload_size = 1024 send_rate = 5120000 recv_rate = 5120000 pex = true seed_mode = false private_peer_ids = "" allow_duplicate_ip = false handshake_timeout = "20s" dial_timeout = "3s" [mempool] recheck = true broadcast = true wal_dir = "" size = 5000 max_txs_bytes = 1073741824 cache_size = 10000 max_tx_bytes = 1048576 max_batch_bytes = 10485760 [statesync] enable = false rpc_servers = "" trust_height = 0 trust_hash = "" trust_period = "168h0m0s" discovery_time = "15s" temp_dir = "" [fastsync] version = "v0" [consensus] wal_file = "data/cs.wal/wal" timeout_propose = "3s" timeout_propose_delta = "500ms" timeout_prevote = "1s" timeout_prevote_delta = "500ms" timeout_precommit = "1s" timeout_precommit_delta = "500ms" timeout_commit = "5s" double_sign_check_height = 0 skip_timeout_commit = false create_empty_blocks = true create_empty_blocks_interval = "0s" peer_gossip_sleep_duration = "100ms" peer_query_maj23_sleep_duration = "2s" [tx_index] indexer = "kv" [instrumentation] prometheus = false prometheus_listen_addr = ":26660" max_open_connections = 3 namespace = "tendermint"

heimdall-config.toml: eth_rpc_url = "< http th endpoint >" bor_rpc_url = "<nginx <-> bor enpoint>" tendermint_rpc_url = "http://0.0.0.0:36657" heimdall_rest_server = "http://0.0.0.0:1317" amqp_url = "amqp://guest:guest@rabbitmq:5672/" checkpoint_poll_interval = "5m0s" syncer_poll_interval = "1m0s" noack_poll_interval = "16m50s" clerk_poll_interval = "10s" span_poll_interval = "1m0s" main_chain_gas_limit = "5000000" no_ack_wait_time = "30m0s"

andreicupaciu avatar Feb 08 '24 11:02 andreicupaciu

Hey @andreicupaciu , we'll have a look at this. Also, the fix for the panic is present in heimdall v1.0.4-beta. We will soon release a stable tag and then you upgrade your heimdall to that.

Raneet10 avatar Feb 15 '24 10:02 Raneet10

This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions[bot] avatar Mar 01 '24 00:03 github-actions[bot]

Hey @andreicupaciu, just following up here. Did you manage to update to the latest version which includes the fix for the panic? Thanks!

manav2401 avatar Mar 14 '24 05:03 manav2401

This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions[bot] avatar Mar 29 '24 00:03 github-actions[bot]

I hope the issue is resolved at your end, closing it for now

VAIBHAVJINDAL3012 avatar Apr 04 '24 13:04 VAIBHAVJINDAL3012