aptos-ait2 icon indicating copy to clipboard operation
aptos-ait2 copied to clipboard

disappeared from the dashboard and sync issues

Open vl8062 opened this issue 2 years ago • 45 comments

My account disappeared from the dashboard and not listed anymore in active validators set 91251d1518590a4d36a4f160236e570b6bfb2c4c96081978e6f7c470e527db65

Also, it's syncing to around 99% and then slowing down significantly and can't get to 100% Log is attached

Plenty of peers:

root@aptos2:~ root@aptos2:~ aptos node show-validator-set --profile ait2 | jq -r '.Result.active_validators' | grep >91251d1518590a4d36a4f160236e570b6bfb2c4c96081978e6f7c470e527db65 root@aptos2:~ root@aptos2:~ root@aptos2:~ curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_network_peer_connected{.remote_peer_id="83424ccb".}" aptos_network_peer_connected{network_id="Validator",peer_id="91251d15",remote_peer_id="83424ccb",role_type="validator"} 1 root@aptos2:~ root@aptos2:~ curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_connections{."Validator".}" >aptos_connections{direction="inbound",network_id="Validator",peer_id="91251d15",role_type="validator"} 17 aptos_connections{direction="outbound",network_id="Validator",peer_id="91251d15",role_type="validator"} 189 root@aptos2:~ root@aptos2:~ curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 778568 aptos_state_sync_version{type="executed_transactions"} 769319 aptos_state_sync_version{type="synced"} 779274 aptos_state_sync_version{type="synced_epoch"} 57 root@aptos2:~ root@aptos2:~ curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_consensus_current_round" HELP aptos_consensus_current_round This counter is set to the last round reported by the local round_state. TYPE aptos_consensus_current_round gauge aptos_consensus_current_round 2877 root@aptos2:~ root@aptos2:~ curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_consensus_proposals_count" HELP aptos_consensus_proposals_count Count of the block proposals sent by this validator since last restart (both primary and secondary) TYPE aptos_consensus_proposals_count counter aptos_consensus_proposals_count 0 root@aptos2:~

logait2.txt

vl8062 avatar Jul 14 '22 01:07 vl8062

OK. now it's worse. It crushed 2 times after restart. Log is attached. Thanks. log2.txt

vl8062 avatar Jul 14 '22 01:07 vl8062

@vl8062 could you please give us more logs before the crash? thank you!

sherry-x avatar Jul 14 '22 01:07 sherry-x

@sherry-x where the full log is located? Source code node. Thanks.

vl8062 avatar Jul 14 '22 02:07 vl8062

updated the file. attached is all I can see in the terminal window log2.txt

vl8062 avatar Jul 14 '22 02:07 vl8062

crushed again. after less than 5min. log3.txt log is attached

vl8062 avatar Jul 14 '22 02:07 vl8062

@vl8062 since you're running from source, can you update to latest testnet branch and retry? https://github.com/aptos-labs/aptos-core/commits/testnet

zekun000 avatar Jul 14 '22 02:07 zekun000

This one? commit 93d36517ed42b798c9402ae26288a36790933ebe Thanks @zekun000

vl8062 avatar Jul 14 '22 02:07 vl8062

@vl8062 yup, let me know if you still see the crash

zekun000 avatar Jul 14 '22 02:07 zekun000

Done.

What about other issues?

  • not able to sync to 100% (attached) Uploading sync1.txt…

  • no proposals (could be related to the sync issue?)

  • gone from the dashboard

I'm using my own VM and assigned 20vCPUs (10GHz) and 16GB RAM Thank you @zekun000

vl8062 avatar Jul 14 '22 02:07 vl8062

slowly updating again but as I mentioned it's never reaching 100% Another log after the node update is log4.txt attached

vl8062 avatar Jul 14 '22 02:07 vl8062

is the version increasing? it probably needs to wait for a little to stabilize, if there's no panic we can wait for some time and check

zekun000 avatar Jul 14 '22 02:07 zekun000

It's stuck. 2-3min between the below queries. New log is attached. log5.txt

aptos_state_sync_version{type="applied_transaction_outputs"} 801507 aptos_state_sync_version{type="executed_transactions"} 797662 aptos_state_sync_version{type="synced"} 801639 aptos_state_sync_version{type="synced_epoch"} 58 root@aptos2:/aptos-core curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 801507 aptos_state_sync_version{type="executed_transactions"} 797662 aptos_state_sync_version{type="synced"} 801639 aptos_state_sync_version{type="synced_epoch"} 58 root@aptos2:/aptos-core curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 801507 aptos_state_sync_version{type="executed_transactions"} 797662 aptos_state_sync_version{type="synced"} 801639 aptos_state_sync_version{type="synced_epoch"} 58 root@aptos2:/aptos-core curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 801507 aptos_state_sync_version{type="executed_transactions"} 797662 aptos_state_sync_version{type="synced"} 801639 aptos_state_sync_version{type="synced_epoch"} 58

vl8062 avatar Jul 14 '22 03:07 vl8062

the new log shows the network is not very stable, how many connections do you have?

zekun000 avatar Jul 14 '22 03:07 zekun000

root@aptos2:/aptos-core curl 127.0.0.1:9101/metrics 2> /dev/null | grep "aptos_connections{."Validator".}" aptos_connections{direction="inbound",network_id="Validator",peer_id="91251d15",role_type="validator"} 86 aptos_connections{direction="outbound",network_id="Validator",peer_id="91251d15",role_type="validator"} 115 root@aptos2:/aptos-core

Internet connection is 50Mbps/25Mbps (Down/Up)

Usually when the sync is stuck I can reboot the node but it'll stuck again soon not reaching 100% sync. It's freezing only when it's reaching around 98-99% sync. Syncing from 0 to 99% is fine, never freeze.

I rebuilt the node 3 times and also tried to use docker but result is always the same. Could this be because of the geolocation?

vl8062 avatar Jul 14 '22 03:07 vl8062

hmm cc @JoshLind if he has some insights, looks to me a lot of RPC timeout in the logs

zekun000 avatar Jul 14 '22 03:07 zekun000

as soon as I restarted the node, it's started syncing again. It'll stop when it'll be close to 100%

root@aptos2: curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 808820 aptos_state_sync_version{type="executed_transactions"} 801639 aptos_state_sync_version{type="synced"} 807712 aptos_state_sync_version{type="synced_epoch"} 59

vl8062 avatar Jul 14 '22 04:07 vl8062

crushed when reached 811829. Log attached log6.txt

root@aptos2:curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 810943 aptos_state_sync_version{type="executed_transactions"} 801857 aptos_state_sync_version{type="synced"} 811161 aptos_state_sync_version{type="synced_epoch"} 60 root@aptos2:curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 811121 aptos_state_sync_version{type="executed_transactions"} 802321 aptos_state_sync_version{type="synced"} 811803 aptos_state_sync_version{type="synced_epoch"} 60 root@aptos2: curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type aptos_state_sync_version{type="applied_transaction_outputs"} 811121 aptos_state_sync_version{type="executed_transactions"} 802347 aptos_state_sync_version{type="synced"} 811829 aptos_state_sync_version{type="synced_epoch"} 60 root@aptos2: curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type root@aptos2: curl 127.0.0.1:9101/metrics 2> /dev/null | grep aptos_state_sync_version | grep type root@aptos2:

vl8062 avatar Jul 14 '22 04:07 vl8062

crushed again log7.txt

vl8062 avatar Jul 14 '22 05:07 vl8062

@vl8062 can you redirect logs to a file so we can get more logs next time it crashes?

zekun000 avatar Jul 14 '22 05:07 zekun000

is this the way to do this? cargo run -p aptos-node --release -- -f ~/$WORKSPACE/testnet/validator.yaml -a >> /path/to/file Thanks @zekun000

vl8062 avatar Jul 14 '22 05:07 vl8062

cargo run -p aptos-node --release -- -f ~/$WORKSPACE/testnet/validator.yaml 2>&1 > /path/to/file

zekun000 avatar Jul 14 '22 06:07 zekun000

also it seems you don't have debug logs, can you run with RUST_LOG=debug cargo run -p aptos-node --release -- -f ~/$WORKSPACE/testnet/validator.yaml 2>&1 > /path/to/file

zekun000 avatar Jul 14 '22 06:07 zekun000

hm.. looks like the file is not growing.. I can see logs on the screen RUST_LOG=debug cargo run -p aptos-node --release -- -f ~/$WORKSPACE/testnet/validator.yaml 2>&1 >/root/logfile.log logfile.log

vl8062 avatar Jul 14 '22 07:07 vl8062

@zekun000 Done. Using script command logfile.zip

vl8062 avatar Jul 14 '22 09:07 vl8062

Thanks @vl8062, this is much more helpful. 😄

@zekun000, what I can see:

  • The node is moving between active state sync (to catch up), and then consensus (to participate), but it does seem to fall behind in consensus (so it ends up moving between state sync and consensus continuously).
  • There's a panic at the end of the logs:
2022-07-14T09:06:37.702394Z [consensus] ERROR crates/crash-handler/src/lib.rs:38 details = '''panicked at 'Failed to commit blocks: InternalError { error: "Cannot find speculation result for block id e96324f5" }', consensus/src/state_computer.rs:175:18'''
  • Before the panic, I see consensus asks state sync to sync to 845295 from 845237:
2022-07-14T09:05:58.295009Z [state-sync-driver] INFO state-sync/state-sync-v2/state-sync-driver/src/driver.rs:324 {"message":"Received a consensus sync notification! Target version: V0(LedgerInfoWithV0 { ledger_info: LedgerInfo { commit_info: BlockInfo { epoch: 65, round: 706, id: HashValue(626ca173c6968fbffb83cc28ea014e6b9543b4eb3ba64cac7902022ff96b4e4c), executed_state_id: HashValue(aa6f7d6a87f0c7d4854551347c89fd9675827a239318d3618f7d557609e94cec), version: 845295, timestamp_usecs: 1657789549394757, next_epoch_state: None }, consensus_data_hash: HashValue(28f1905b333f7ca445794dd0541c8443d8c0d0c55b3683044de53bf7f38698b9) }, ...)} }). Latest synced version: 845237","name":"consensus_notification"}

But, then I immediately see logs like:

2022-07-14T09:05:58.327848Z [consensus] DEBUG consensus/src/state_computer.rs:110 Executing block {"block_id":"f63258e0ec76b21e540679e13bbfff1851a88cbf6275c06bda9dc25e3e13bd59","parent_id":"e96324f5d63ea27356eadc988b135f310d785154c1885c57286b54b5dea6ed83"}
2022-07-14T09:05:58.327922Z [consensus] INFO execution/executor/src/block_executor.rs:104 execute_block {"block_id":"f63258e0ec76b21e540679e13bbfff1851a88cbf6275c06bda9dc25e3e13bd59","name":"block_executor"}
2022-07-14T09:05:58.357561Z [consensus] DEBUG consensus/src/state_computer.rs:110 Executing block {"block_id":"c666985bcceeb476092d3c0ddac14c2be1856a3f5665b03b226ae17903fa0ab1","parent_id":"f63258e0ec76b21e540679e13bbfff1851a88cbf6275c06bda9dc25e3e13bd59"}
2022-07-14T09:05:58.357607Z [consensus] INFO execution/executor/src/block_executor.rs:104 execute_block {"block_id":"c666985bcceeb476092d3c0ddac14c2be1856a3f5665b03b226ae17903fa0ab1","name":"block_executor"}
2022-07-14T09:05:58.385430Z [consensus] DEBUG consensus/src/state_computer.rs:110 Executing block {"block_id":"4eff39cdcadeba96ddb02ff56c072f666f8549d1ee594066def0ce0410c91afe","parent_id":"c666985bcceeb476092d3c0ddac14c2be1856a3f5665b03b226ae17903fa0ab1"}
2022-07-14T09:05:58.385503Z [consensus] INFO execution/executor/src/block_executor.rs:104 execute_block {"block_id":"4eff39cdcadeba96ddb02ff56c072f666f8549d1ee594066def0ce0410c91afe","name":"block_executor"}

Is consensus still executing here? Afterwards I see that state sync responds to consensus Ok(), but before that I see many consensus logs, so I'm wondering if consensus is still running despite asking state sync to sync. Thoughts?

JoshLind avatar Jul 14 '22 12:07 JoshLind

@vl8062 this is very helpful, we now understand the problem, will send out a fix today, stay tuned!

zekun000 avatar Jul 14 '22 16:07 zekun000

pushed new fix to testnet branch, please update

zekun000 avatar Jul 14 '22 22:07 zekun000

Thank you @zekun000!

Also, lots of testers are reporting tremendous traffic generated by the nodes.. I can see around 500GB received by my node over the last 8 hours.. Is this on roadmap to fix as well?

vl8062 avatar Jul 14 '22 23:07 vl8062

you mean the network traffic or the log size?

zekun000 avatar Jul 14 '22 23:07 zekun000

network traffic.

vl8062 avatar Jul 14 '22 23:07 vl8062

it’s definitely on our roadmap to look into, thanks for sharing!

zekun000 avatar Jul 14 '22 23:07 zekun000

also can you share what metric you look at for the network traffic? and did you delete the db and restart the node multiple times?

zekun000 avatar Jul 15 '22 00:07 zekun000

I can see Internet traffic stats on my ISP portal. I restarted my node 5-6 times yesterday because it was crushing but I didn't delete DB or anything. Interestingly that the node didnt crush overnight and is still running..

vl8062 avatar Jul 15 '22 00:07 vl8062

aptos node testers are reporting 45Mb/s average with spikes up to 290Mb/s

vl8062 avatar Jul 15 '22 00:07 vl8062

is this upload or download traffic or both?

zekun000 avatar Jul 15 '22 00:07 zekun000

download. But not much difference with upload. I think it depends on how many peers are connected in/out

vl8062 avatar Jul 15 '22 02:07 vl8062

@vl8062 do you have your metrics port open? we've seen people got attacked on their metric port before by pulling tons of data from metrics port

sherry-x avatar Jul 15 '22 02:07 sherry-x

@sherry-x ports 9101 and 80 are not forwarded on the router so there is no access from the Internet.

BTW, synchronization stopped again at some stage... but the node is still running. Yesterday it was crushing often

image

vl8062 avatar Jul 15 '22 04:07 vl8062

@vl8062 we have pushed the testnet branch to reduce the network traffic, if you want to try it out a0290ec4b34d916ab51e874ae210cf6175a4fa72

zekun000 avatar Jul 15 '22 04:07 zekun000

not syncing after the update image log8.txt

vl8062 avatar Jul 15 '22 04:07 vl8062

it looks just trying to sync, if it's still not syncing like this. @JoshLind will take a look tomorrow

zekun000 avatar Jul 15 '22 04:07 zekun000

updated to the latest, still no sync image

vl8062 avatar Jul 16 '22 04:07 vl8062

Looks like it's not syncing because of the slow Internet connection.. The node used all my monthly data (1TB) in 2 days and now speed is shaped down to 1.5Mbps. I deleted Data folder and this is what is synced after more than 10min of waiting:

image

vl8062 avatar Jul 16 '22 06:07 vl8062

the node can’t work with such limited bandwidth, even after our optimization 1.5Mbps is not sufficient

zekun000 avatar Jul 16 '22 06:07 zekun000

@sherry-x just an update. I think the issue was my ISP. I switched to a different one yesterday and the node was running with no problems, moving from 213 to 204 in the dashboard overnight. That's very strange because I didnt have any issues using the first ISP during AIT1...

vl8062 avatar Jul 23 '22 00:07 vl8062