bor
bor copied to clipboard
bor - sync stops and resumes after ~30 seconds
Hi!
I'm having sync issues on 3 bor nodes. Each 3 - 4 minutes, each node stops syncing for couple of seconds. Issues appeared ~ 2 weeks ago. Nodes are up since ~ 2 years ago, no issues until now. All nodes are in EU, having identical hardware:
NVMEs used for bor database. Strong AMD CPUs, 256 GB Ram each. bor runs inside docker container, having 32 cpus / 128GB Ram allocated. NVMEs are not shared with other projects.
Client versions: bor: 1.1.0 (2 nodes) & 1x 1.2.3 (1 node) heimdall: 1.0.3
Errors seen on bor during sync stop:
WARN [01-30|07:18:30.964] an error while trying fetching from Heimdall path="/milestone/ID/9839a926-38ee-460c-9b19-05d7bbbc31cd - 0xa1dff281c5b731b6da4d384055de73f683902000" attempt=1 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:18:50.966] an error while trying fetching from Heimdall path="/milestone/ID/9839a926-38ee-460c-9b19-05d7bbbc31cd - 0xa1dff281c5b731b6da4d384055de73f683902000" attempt=5 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:19:34.309] an error while trying fetching from Heimdall path="/milestone/ID/53007ce6-46c3-4905-9778-bd278cd71623 - 0x1ab64376e7369d35b3921d0d7114a61a8114b9a7" attempt=1 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:19:54.311] an error while trying fetching from Heimdall path="/milestone/ID/53007ce6-46c3-4905-9778-bd278cd71623 - 0x1ab64376e7369d35b3921d0d7114a61a8114b9a7" attempt=5 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:20:04.211] an error while trying fetching from Heimdall path="/milestone/ID/73e21045-3b94-4be1-a1ee-c2bd2682b6a6 - 0x01805ff9d0a87650f52de7d12a80d75a995cf3b7" attempt=1 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:20:22.876] an error while trying fetching from Heimdall path="/milestone/ID/73e21045-3b94-4be1-a1ee-c2bd2682b6a6 - 0x01805ff9d0a87650f52de7d12a80d75a995cf3b7" attempt=1 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:20:51.873] an error while trying fetching from Heimdall path="/milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521 - 0x4099f90e3f700b78bf98fa43617b2ce8f40f48a8" attempt=1 error="error while fetching data from Heimdall: response code 500" WARN [01-30|07:21:11.876] an error while trying fetching from Heimdall path="/milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521 - 0x4099f90e3f700b78bf98fa43617b2ce8f40f48a8" attempt=5 error="error while fetching data from Heimdall: response code 500"
During this time, bor stops syncing and following log appears:
WARN [01-30|07:25:16.510] Served bor_getVoteOnHash conn=192.168.0.3:46018 reqid=6309 duration=29.90012752s err="Milestone ID doesn't exist in Heimdall" INFO [01-30|07:25:16.513] Got new milestone from heimdall start=52,915,439 end=52,915,458 hash=0xb30a388c47a0422624b9ec3ba72d03a16560e8f258d287f544d28c3527dc7ce0
After mentioned logs, bor eth.syncing returns true and it catches up (~ 20 - 30 blocks behind) This happens each 3 - 5 minutes. Each time, bor fells behind with ~ 20 - 30 blocks.
On heimdallr, I see GETs like:
x.x.x.x - - [30/Jan/2024:07:20:51 +0000] "GET /milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521%20-%200x4099f90e3f700b78bf98fa43617b2ce8f40f48a8 HTTP/1.1" 500 173 "-" "Go-http-client/1.1" "-" x.x.x.x - - [30/Jan/2024:07:20:56 +0000] "GET /milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521%20-%200x4099f90e3f700b78bf98fa43617b2ce8f40f48a8 HTTP/1.1" 500 173 "-" "Go-http-client/1.1" "-" x.x.x.x - - [30/Jan/2024:07:21:01 +0000] "GET /milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521%20-%200x4099f90e3f700b78bf98fa43617b2ce8f40f48a8 HTTP/1.1" 500 173 "-" "Go-http-client/1.1" "-" x.x.x.x - - [30/Jan/2024:07:21:06 +0000] "GET /milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521%20-%200x4099f90e3f700b78bf98fa43617b2ce8f40f48a8 HTTP/1.1" 500 173 "-" "Go-http-client/1.1" "-" x.x.x.x - - [30/Jan/2024:07:21:11 +0000] "GET /milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521%20-%200x4099f90e3f700b78bf98fa43617b2ce8f40f48a8 HTTP/1.1" 500 173 "-" "Go-http-client/1.1" "-" x.x.x.x - - [30/Jan/2024:07:21:16 +0000] "GET /milestone/ID/914ad250-c2a7-4161-ba9f-ea771da23521%20-%200x4099f90e3f700b78bf98fa43617b2ce8f40f48a8 HTTP/1.1" 500 173 "-" "Go-http-client/1.1" "-"
Can you please check whether Heimdall node is working fine or not? And Heimdall URL value is correct in Bor config.
Yes, everything is correct in bor config / heimdall config. Again, issue is temporary:
After this last error: WARN [01-30|07:25:16.510] Served bor_getVoteOnHash conn=192.168.0.3:46018 reqid=6309 duration=29.90012752s err="Milestone ID doesn't exist in Heimdall"
Bor gets new miletone from heimdall after sending this get request to heimdall: x.x.x.x - - [01/Feb/2024:11:38:43 +0000] "GET /milestone/latest HTTP/1.1" 200 351 "-" "Go-http-client/1.1" "-"
Bor logs: INFO [01-30|07:25:16.513] Got new milestone from heimdall start=52,915,439 end=52,915,458 hash=0xb30a388c47a0422624b9ec3ba72d03a16560e8f258d287f544d28c3527dc7ce0
Bor starts to sync again (eth.sync= true), being 20 - 30 blocks behind.
Heimdallr error observed:
INFO [2024-02-02|06:48:56.498] Served RPC HTTP response module=rest-server method=GET url=/milestone/ID/4d57f1d5-1eb2-4925-a299-9a0c7308ffaa%20-%200xed83ea13b553951c6522a22e87d65766dea4d3d3 status=500 duration=0 remoteAddr=x.x.x.x:57696 ERROR[2024-02-02|06:48:56.508] Panic in RPC HTTP handler module=rest-server err="interface conversion: interface {} is nil, not string" stack="goroutine 841 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x65\ngithub.com/maticnetwork/heimdall/server.recoverAndLog.func1.1()\n\t/go/src/github.com/maticnetwork/heimdall/server/root.go:146 +0x14f\npanic({0x1a27c00, 0xc000641b00})\n\t/usr/local/go/src/runtime/panic.go:884 +0x213\ngithub.com/maticnetwork/heimdall/checkpoint/types.GetMilestoneID(...)\n\t/go/src/github.com/maticnetwork/heimdall/checkpoint/types/cache.go:10\ngithub.com/maticnetwork/heimdall/checkpoint/client/rest.milestoneByIDHandlerFn.func1({0x26aa2b0, 0xc0005d12d8}, 0xc000641a70?)\n\t/go/src/github.com/maticnetwork/heimdall/checkpoint/client/rest/query_milestone.go:255 +0x405\nnet/http.HandlerFunc.ServeHTTP(0xc000ada500?, {0x26aa2b0?, 0xc0005d12d8?}, 0x7f98c9b4a168?)\n\t/usr/local/go/src/net/http/server.go:2122 +0x2f\ngithub.com/gorilla/mux.(*Router).ServeHTTP(0xc000001140, {0x26aa2b0, 0xc0005d12d8}, 0xc000ada400)\n\t/root/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210 +0x1cf\ngithub.com/maticnetwork/heimdall/server.maxBytesHandler.ServeHTTP({{0x269a560?, 0xc000001140?}, 0x40fe27?}, {0x26aa2b0?, 0xc0005d12d8}, 0xc000ada400)\n\t/go/src/github.com/maticnetwork/heimdall/server/root.go:120 +0x124\ngithub.com/maticnetwork/heimdall/server.recoverAndLog.func1({0x26ac080?, 0xc00067a700}, 0xc000ada400)\n\t/go/src/github.com/maticnetwork/heimdall/server/root.go:165 +0x326\ngithub.com/maticnetwork/heimdall/server.startRPCServer.func1({0x26ac080, 0xc00067a700}, 0xc000ada400)\n\t/go/src/github.com/maticnetwork/heimdall/server/root.go:188 +0xe2\nnet/http.HandlerFunc.ServeHTTP(0x0?, {0x26ac080?, 0xc00067a700?}, 0x47010e?)\n\t/usr/local/go/src/net/http/server.go:2122 +0x2f\nnet/http.serverHandler.ServeHTTP({0xc000641920?}, {0x26ac080, 0xc00067a700}, 0xc000ada400)\n\t/usr/local/go/src/net/http/server.go:2936 +0x316\nnet/http.(*conn).serve(0xc000a9f170, {0x26ad168, 0xc000c7e0c0})\n\t/usr/local/go/src/net/http/server.go:1995 +0x612\ncreated by net/http.(*Server).Serve\n\t/usr/local/go/src/net/http/server.go:3089 +0x5ed\n"
Are you running a validator?
No, full nodes for RPC purposes.
Additional info: when using public heimdall endpoint ("https://heimdall-api.polygon.technology), sync works as expected.
heimdall config.toml:
proxy_app = "tcp://127.0.0.1:26658" moniker = "e6b94a5a8355" fast_sync = true db_backend = "goleveldb" db_dir = "data" log_level = "main:info,state:info,statesync:info,:error" log_format = "plain" genesis_file = "config/genesis.json" priv_validator_key_file = "config/priv_validator_key.json" priv_validator_state_file = "data/priv_validator_state.json" priv_validator_laddr = "" node_key_file = "config/node_key.json" abci = "socket" filter_peers = false [rpc] laddr = "tcp://127.0.0.1:26657" cors_allowed_origins = [""] cors_allowed_methods = ["HEAD", "GET", "POST", ] cors_allowed_headers = ["Origin", "Accept", "Content-Type", "X-Requested-With", "X-Server-Time", ] grpc_laddr = "" grpc_max_open_connections = 900 unsafe = false max_open_connections = 900 max_subscription_clients = 100 max_subscriptions_per_client = 5 timeout_broadcast_tx_commit = "10s" max_body_bytes = 1000000 max_header_bytes = 1048576 tls_cert_file = "" tls_key_file = "" pprof_laddr = "localhost:6060" [p2p] laddr = "tcp://0.0.0.0:26656" external_address = "tcp://x.x.x.x:26656" seeds = "[email protected]:26656,[email protected]:26656" persistent_peers = "" upnp = false addr_book_file = "config/addrbook.json" addr_book_strict = true max_num_inbound_peers = 40 max_num_outbound_peers = 10 unconditional_peer_ids = "" persistent_peers_max_dial_period = "0s" flush_throttle_timeout = "100ms" max_packet_msg_payload_size = 1024 send_rate = 5120000 recv_rate = 5120000 pex = true seed_mode = false private_peer_ids = "" allow_duplicate_ip = false handshake_timeout = "20s" dial_timeout = "3s" [mempool] recheck = true broadcast = true wal_dir = "" size = 5000 max_txs_bytes = 1073741824 cache_size = 10000 max_tx_bytes = 1048576 max_batch_bytes = 10485760 [statesync] enable = false rpc_servers = "" trust_height = 0 trust_hash = "" trust_period = "168h0m0s" discovery_time = "15s" temp_dir = "" [fastsync] version = "v0" [consensus] wal_file = "data/cs.wal/wal" timeout_propose = "3s" timeout_propose_delta = "500ms" timeout_prevote = "1s" timeout_prevote_delta = "500ms" timeout_precommit = "1s" timeout_precommit_delta = "500ms" timeout_commit = "5s" double_sign_check_height = 0 skip_timeout_commit = false create_empty_blocks = true create_empty_blocks_interval = "0s" peer_gossip_sleep_duration = "100ms" peer_query_maj23_sleep_duration = "2s" [tx_index] indexer = "kv" [instrumentation] prometheus = false prometheus_listen_addr = ":26660" max_open_connections = 3 namespace = "tendermint"
heimdall-config.toml: eth_rpc_url = "< http th endpoint >" bor_rpc_url = "<nginx <-> bor enpoint>" tendermint_rpc_url = "http://0.0.0.0:36657" heimdall_rest_server = "http://0.0.0.0:1317" amqp_url = "amqp://guest:guest@rabbitmq:5672/" checkpoint_poll_interval = "5m0s" syncer_poll_interval = "1m0s" noack_poll_interval = "16m50s" clerk_poll_interval = "10s" span_poll_interval = "1m0s" main_chain_gas_limit = "5000000" no_ack_wait_time = "30m0s"
Hey @andreicupaciu , we'll have a look at this. Also, the fix for the panic is present in heimdall v1.0.4-beta
. We will soon release a stable tag and then you upgrade your heimdall to that.
This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.
Hey @andreicupaciu, just following up here. Did you manage to update to the latest version which includes the fix for the panic? Thanks!
This issue is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 14 days.
I hope the issue is resolved at your end, closing it for now