aptos-ait2
aptos-ait2 copied to clipboard
disappeared from the dashboard and sync issues
My account disappeared from the dashboard and not listed anymore in active validators set 91251d1518590a4d36a4f160236e570b6bfb2c4c96081978e6f7c470e527db65
Also, it's syncing to around 99% and then slowing down significantly and can't get to 100% Log is attached
Plenty of peers:
root@aptos2:~ root@aptos2:~ aptos node show-validator-set --profile ait2 | jq -r '.Result.active_validators' | grep >91251d1518590a4d36a4f160236e570b6bfb2c4c96081978e6f7c470e527db65 root@aptos2:~ root@aptos2:~ root@aptos2:~ curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_network_peer_connected{.remote_peer_id="83424ccb".}" aptos_network_peer_connected{network_id="Validator",peer_id="91251d15",remote_peer_id="83424ccb",role_type="validator"} 1 root@aptos2:~ root@aptos2:~ curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_connections{."Validator".}" >aptos_connections{direction="inbound",network_id="Validator",peer_id="91251d15",role_type="validator"} 17 aptos_connections{direction="outbound",network_id="Validator",peer_id="91251d15",role_type="validator"} 189 root@aptos2:~ root@aptos2:~ curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 778568 aptos_state_sync_version{type="executed_transactions"} 769319 aptos_state_sync_version{type="synced"} 779274 aptos_state_sync_version{type="synced_epoch"} 57 root@aptos2:~ root@aptos2:~ curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_consensus_current_round" HELP aptos_consensus_current_round This counter is set to the last round reported by the local round_state. TYPE aptos_consensus_current_round gauge aptos_consensus_current_round 2877 root@aptos2:~ root@aptos2:~ curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_consensus_proposals_count" HELP aptos_consensus_proposals_count Count of the block proposals sent by this validator since last restart (both primary and secondary) TYPE aptos_consensus_proposals_count counter aptos_consensus_proposals_count 0 root@aptos2:~
OK. now it's worse. It crushed 2 times after restart. Log is attached. Thanks. log2.txt
@vl8062 could you please give us more logs before the crash? thank you!
@sherry-x where the full log is located? Source code node. Thanks.
updated the file. attached is all I can see in the terminal window log2.txt
crushed again. after less than 5min. log3.txt log is attached
@vl8062 since you're running from source, can you update to latest testnet branch and retry? https://github.com/aptos-labs/aptos-core/commits/testnet
This one? commit 93d36517ed42b798c9402ae26288a36790933ebe Thanks @zekun000
@vl8062 yup, let me know if you still see the crash
Done.
What about other issues?
-
not able to sync to 100% (attached) Uploading sync1.txt…
-
no proposals (could be related to the sync issue?)
-
gone from the dashboard
I'm using my own VM and assigned 20vCPUs (10GHz) and 16GB RAM Thank you @zekun000
slowly updating again but as I mentioned it's never reaching 100% Another log after the node update is log4.txt attached
is the version increasing? it probably needs to wait for a little to stabilize, if there's no panic we can wait for some time and check
It's stuck. 2-3min between the below queries. New log is attached. log5.txt
aptos_state_sync_version{type="applied_transaction_outputs"} 801507 aptos_state_sync_version{type="executed_transactions"} 797662 aptos_state_sync_version{type="synced"} 801639 aptos_state_sync_version{type="synced_epoch"} 58 root@aptos2:/aptos-core curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 801507 aptos_state_sync_version{type="executed_transactions"} 797662 aptos_state_sync_version{type="synced"} 801639 aptos_state_sync_version{type="synced_epoch"} 58 root@aptos2:/aptos-core curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 801507 aptos_state_sync_version{type="executed_transactions"} 797662 aptos_state_sync_version{type="synced"} 801639 aptos_state_sync_version{type="synced_epoch"} 58 root@aptos2:/aptos-core curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 801507 aptos_state_sync_version{type="executed_transactions"} 797662 aptos_state_sync_version{type="synced"} 801639 aptos_state_sync_version{type="synced_epoch"} 58
the new log shows the network is not very stable, how many connections do you have?
root@aptos2:/aptos-core curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_connections{."Validator".}" aptos_connections{direction="inbound",network_id="Validator",peer_id="91251d15",role_type="validator"} 86 aptos_connections{direction="outbound",network_id="Validator",peer_id="91251d15",role_type="validator"} 115 root@aptos2:/aptos-core
Internet connection is 50Mbps/25Mbps (Down/Up)
Usually when the sync is stuck I can reboot the node but it'll stuck again soon not reaching 100% sync. It's freezing only when it's reaching around 98-99% sync. Syncing from 0 to 99% is fine, never freeze.
I rebuilt the node 3 times and also tried to use docker but result is always the same. Could this be because of the geolocation?
hmm cc @JoshLind if he has some insights, looks to me a lot of RPC timeout in the logs
as soon as I restarted the node, it's started syncing again. It'll stop when it'll be close to 100%
root@aptos2: curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 808820 aptos_state_sync_version{type="executed_transactions"} 801639 aptos_state_sync_version{type="synced"} 807712 aptos_state_sync_version{type="synced_epoch"} 59
crushed when reached 811829. Log attached log6.txt
root@aptos2:curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 810943 aptos_state_sync_version{type="executed_transactions"} 801857 aptos_state_sync_version{type="synced"} 811161 aptos_state_sync_version{type="synced_epoch"} 60 root@aptos2:curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 811121 aptos_state_sync_version{type="executed_transactions"} 802321 aptos_state_sync_version{type="synced"} 811803 aptos_state_sync_version{type="synced_epoch"} 60 root@aptos2: curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 811121 aptos_state_sync_version{type="executed_transactions"} 802347 aptos_state_sync_version{type="synced"} 811829 aptos_state_sync_version{type="synced_epoch"} 60 root@aptos2: curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type root@aptos2: curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type root@aptos2:
crushed again log7.txt
@vl8062 can you redirect logs to a file so we can get more logs next time it crashes?
is this the way to do this? cargo run -p aptos-node --release -- -f ~/$WORKSPACE/testnet/validator.yaml -a >> /path/to/file Thanks @zekun000
cargo run -p aptos-node --release -- -f ~/$WORKSPACE/testnet/validator.yaml 2>&1 > /path/to/file
also it seems you don't have debug logs, can you run with RUST_LOG=debug cargo run -p aptos-node --release -- -f ~/$WORKSPACE/testnet/validator.yaml 2>&1 > /path/to/file
hm.. looks like the file is not growing.. I can see logs on the screen RUST_LOG=debug cargo run -p aptos-node --release -- -f ~/$WORKSPACE/testnet/validator.yaml 2>&1 >/root/logfile.log logfile.log
@zekun000 Done. Using script command logfile.zip
Thanks @vl8062, this is much more helpful. 😄
@zekun000, what I can see:
- The node is moving between active state sync (to catch up), and then consensus (to participate), but it does seem to fall behind in consensus (so it ends up moving between state sync and consensus continuously).
- There's a panic at the end of the logs:
2022-07-14T09:06:37.702394Z [consensus] ERROR crates/crash-handler/src/lib.rs:38 details = '''panicked at 'Failed to commit blocks: InternalError { error: "Cannot find speculation result for block id e96324f5" }', consensus/src/state_computer.rs:175:18'''
- Before the panic, I see consensus asks state sync to sync to
845295
from845237
:
2022-07-14T09:05:58.295009Z [state-sync-driver] INFO state-sync/state-sync-v2/state-sync-driver/src/driver.rs:324 {"message":"Received a consensus sync notification! Target version: V0(LedgerInfoWithV0 { ledger_info: LedgerInfo { commit_info: BlockInfo { epoch: 65, round: 706, id: HashValue(626ca173c6968fbffb83cc28ea014e6b9543b4eb3ba64cac7902022ff96b4e4c), executed_state_id: HashValue(aa6f7d6a87f0c7d4854551347c89fd9675827a239318d3618f7d557609e94cec), version: 845295, timestamp_usecs: 1657789549394757, next_epoch_state: None }, consensus_data_hash: HashValue(28f1905b333f7ca445794dd0541c8443d8c0d0c55b3683044de53bf7f38698b9) }, ...)} }). Latest synced version: 845237","name":"consensus_notification"}
But, then I immediately see logs like:
2022-07-14T09:05:58.327848Z [consensus] DEBUG consensus/src/state_computer.rs:110 Executing block {"block_id":"f63258e0ec76b21e540679e13bbfff1851a88cbf6275c06bda9dc25e3e13bd59","parent_id":"e96324f5d63ea27356eadc988b135f310d785154c1885c57286b54b5dea6ed83"}
2022-07-14T09:05:58.327922Z [consensus] INFO execution/executor/src/block_executor.rs:104 execute_block {"block_id":"f63258e0ec76b21e540679e13bbfff1851a88cbf6275c06bda9dc25e3e13bd59","name":"block_executor"}
2022-07-14T09:05:58.357561Z [consensus] DEBUG consensus/src/state_computer.rs:110 Executing block {"block_id":"c666985bcceeb476092d3c0ddac14c2be1856a3f5665b03b226ae17903fa0ab1","parent_id":"f63258e0ec76b21e540679e13bbfff1851a88cbf6275c06bda9dc25e3e13bd59"}
2022-07-14T09:05:58.357607Z [consensus] INFO execution/executor/src/block_executor.rs:104 execute_block {"block_id":"c666985bcceeb476092d3c0ddac14c2be1856a3f5665b03b226ae17903fa0ab1","name":"block_executor"}
2022-07-14T09:05:58.385430Z [consensus] DEBUG consensus/src/state_computer.rs:110 Executing block {"block_id":"4eff39cdcadeba96ddb02ff56c072f666f8549d1ee594066def0ce0410c91afe","parent_id":"c666985bcceeb476092d3c0ddac14c2be1856a3f5665b03b226ae17903fa0ab1"}
2022-07-14T09:05:58.385503Z [consensus] INFO execution/executor/src/block_executor.rs:104 execute_block {"block_id":"4eff39cdcadeba96ddb02ff56c072f666f8549d1ee594066def0ce0410c91afe","name":"block_executor"}
Is consensus still executing here? Afterwards I see that state sync responds to consensus Ok()
, but before that I see many consensus logs, so I'm wondering if consensus is still running despite asking state sync to sync. Thoughts?
@vl8062 this is very helpful, we now understand the problem, will send out a fix today, stay tuned!
pushed new fix to testnet branch, please update
Thank you @zekun000!
Also, lots of testers are reporting tremendous traffic generated by the nodes.. I can see around 500GB received by my node over the last 8 hours.. Is this on roadmap to fix as well?
you mean the network traffic or the log size?
network traffic.
it’s definitely on our roadmap to look into, thanks for sharing!
also can you share what metric you look at for the network traffic? and did you delete the db and restart the node multiple times?
I can see Internet traffic stats on my ISP portal. I restarted my node 5-6 times yesterday because it was crushing but I didn't delete DB or anything. Interestingly that the node didnt crush overnight and is still running..
aptos node testers are reporting 45Mb/s average with spikes up to 290Mb/s
is this upload or download traffic or both?
download. But not much difference with upload. I think it depends on how many peers are connected in/out
@vl8062 do you have your metrics port open? we've seen people got attacked on their metric port before by pulling tons of data from metrics port
@sherry-x ports 9101 and 80 are not forwarded on the router so there is no access from the Internet.
BTW, synchronization stopped again at some stage... but the node is still running. Yesterday it was crushing often
@vl8062 we have pushed the testnet branch to reduce the network traffic, if you want to try it out a0290ec4b34d916ab51e874ae210cf6175a4fa72
not syncing after the update
log8.txt
it looks just trying to sync, if it's still not syncing like this. @JoshLind will take a look tomorrow
updated to the latest, still no sync
Looks like it's not syncing because of the slow Internet connection.. The node used all my monthly data (1TB) in 2 days and now speed is shaped down to 1.5Mbps. I deleted Data folder and this is what is synced after more than 10min of waiting:
the node can’t work with such limited bandwidth, even after our optimization 1.5Mbps is not sufficient
@sherry-x just an update. I think the issue was my ISP. I switched to a different one yesterday and the node was running with no problems, moving from 213 to 204 in the dashboard overnight. That's very strange because I didnt have any issues using the first ISP during AIT1...