aptos-core
aptos-core copied to clipboard
[network] Add REST API discovery for Validator full nodes
Description
Now, if the network has fallen too far behind, a user can use the REST API discovery to parse the Validator set directly to get a more up to date view of the validator set.
Test Plan
Tested locally with onchain discovery off. It does get the nodes, and syncs from onchain
Forge is running suite land_blocking on 3d3abd19f220078536b49f0e5b94edbc1f31ef33
- Grafana dashboard (auto-refresh)
- Humio Logs
- (Deprecated) OpenSearch Logs
- Test runner output
- Test run is land-blocking
Forge is running suite compat on 843b204dce971d98449b82624f4f684c7a18b991 ==> 3d3abd19f220078536b49f0e5b94edbc1f31ef33
- Grafana dashboard (auto-refresh)
- Humio Logs
- (Deprecated) OpenSearch Logs
- Test runner output
- Test run is land-blocking
:white_check_mark: Forge suite compat success on 843b204dce971d98449b82624f4f684c7a18b991 ==> 3d3abd19f220078536b49f0e5b94edbc1f31ef33
Compatibility test results for 843b204dce971d98449b82624f4f684c7a18b991 ==> 3d3abd19f220078536b49f0e5b94edbc1f31ef33 (PR)
1. Check liveness of validators at old version: 843b204dce971d98449b82624f4f684c7a18b991
compatibility::simple-validator-upgrade::liveness-check : 7334 TPS, 5186 ms latency, 8300 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 3d3abd19f220078536b49f0e5b94edbc1f31ef33
compatibility::simple-validator-upgrade::single-validator-upgrade : 5503 TPS, 6865 ms latency, 9100 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 3d3abd19f220078536b49f0e5b94edbc1f31ef33
compatibility::simple-validator-upgrade::half-validator-upgrade : 4667 TPS, 8303 ms latency, 12300 ms p99 latency,no expired txns
4. upgrading second batch to new version: 3d3abd19f220078536b49f0e5b94edbc1f31ef33
compatibility::simple-validator-upgrade::rest-validator-upgrade : 7668 TPS, 4778 ms latency, 8500 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for 843b204dce971d98449b82624f4f684c7a18b991 ==> 3d3abd19f220078536b49f0e5b94edbc1f31ef33 passed
Test Ok
- Grafana dashboard
- Humio Logs
- (Deprecated) OpenSearch Logs
- Test runner output
- Test run is land-blocking
:white_check_mark: Forge suite land_blocking success on 3d3abd19f220078536b49f0e5b94edbc1f31ef33
performance benchmark with full nodes : 7839 TPS, 5061 ms latency, 7500 ms p99 latency,(!) expired 180 out of 3386740 txns
Test Ok
- Grafana dashboard
- Humio Logs
- (Deprecated) OpenSearch Logs
- Test runner output
- Test run is land-blocking
Forge is running suite compat on 843b204dce971d98449b82624f4f684c7a18b991 ==> 9385c03ab28146414428f45556081bb5ec854cb4
- Grafana dashboard (auto-refresh)
- Humio Logs
- Test runner output
- Test run is land-blocking
Forge is running suite land_blocking on 9385c03ab28146414428f45556081bb5ec854cb4
- Grafana dashboard (auto-refresh)
- Humio Logs
- Test runner output
- Test run is land-blocking
:white_check_mark: Forge suite land_blocking success on 9385c03ab28146414428f45556081bb5ec854cb4
performance benchmark with full nodes : 6624 TPS, 5995 ms latency, 10100 ms p99 latency,(!) expired 920 out of 2829740 txns
Test Ok
- Grafana dashboard
- Humio Logs
- Test runner output
- Test run is land-blocking
:white_check_mark: Forge suite compat success on 843b204dce971d98449b82624f4f684c7a18b991 ==> 9385c03ab28146414428f45556081bb5ec854cb4
Compatibility test results for 843b204dce971d98449b82624f4f684c7a18b991 ==> 9385c03ab28146414428f45556081bb5ec854cb4 (PR)
1. Check liveness of validators at old version: 843b204dce971d98449b82624f4f684c7a18b991
compatibility::simple-validator-upgrade::liveness-check : 7829 TPS, 4916 ms latency, 6600 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 9385c03ab28146414428f45556081bb5ec854cb4
compatibility::simple-validator-upgrade::single-validator-upgrade : 4308 TPS, 10549 ms latency, 15900 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 9385c03ab28146414428f45556081bb5ec854cb4
compatibility::simple-validator-upgrade::half-validator-upgrade : 4924 TPS, 8436 ms latency, 11300 ms p99 latency,no expired txns
4. upgrading second batch to new version: 9385c03ab28146414428f45556081bb5ec854cb4
compatibility::simple-validator-upgrade::rest-validator-upgrade : 7438 TPS, 5174 ms latency, 8900 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for 843b204dce971d98449b82624f4f684c7a18b991 ==> 9385c03ab28146414428f45556081bb5ec854cb4 passed
Test Ok
- Grafana dashboard
- Humio Logs
- Test runner output
- Test run is land-blocking
Given the recent ping on this, a couple thoughts:
- IIUC, this only solves the problem that a node has a version of the validator set where all VFNs in that version have rotated their network addresses? e.g., think genesis blob from 2 years ago where the VFNs have all relocated. But, this does not solve the issue of saturation, i.e., all VFNs are at maximum capacity.
- In order to discover from a REST url, you need to plug in a REST url into the config, which begs the question where do we get this from? We should probably update the documentation to call this out. Note: it's probably the same amount of work as adding a seed to the config.
- It would be great if we could add some/any tests. There is a smoke test for file discovery, right? Could we modify it?
- This is only to solve the event that the VFNs are saturated that haven't rotated their network addresses from genesis. This is highly likely since this set only strictly shrinks, where the number of connecting nodes strictly grows. So even if there are 100 VFNs that aren't saturated, the 5 that haven't rotated their keys or address since genesis would have a significantly higher load. If you don't restart the node it will never disconnect from those VFNs.
- I would argue this is fine, you can use the public good API endpoint simply for this.
- The smoke test for file discovery is flawed / flaky. This would require a bunch of work around the testing framework. I'd be happy to work with you on getting that in, but my focus is mostly elsewhere
I argue that we probably don't want to use addresses in genesis or in storage to connect a node. Those are undoubtedly out of date. Those that haven't changed their addresses are now out of date. Considering that we still have a waypoint to get kick started. I want to push back against all of Josh's arguments as nice to have. The one thing I would like to see, however, is some sort of test. Not necessarily an e2e test, but at least something that verifies that this can't get accidentally broken.
@JoshLind, thoughts?
I want to push back against all of Josh's arguments as nice to have. The one thing I would like to see, however, is some sort of test. Not necessarily an e2e test, but at least something that verifies that this can't get accidentally broken.
I'm hopeful that we'll be able to address most of these by AIT4. But, if we want to land this PR for now, that makes sense to me. The only blocker is the test of some form :)
We'll probably get some feedback from the community about this. Can we add this to a guide somewhere?
We'll probably get some feedback from the community about this. Can we add this to a guide somewhere?
I'll make a followup PR. Discovery of nodes or something like that
Forge is running suite compat on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 46222aaac0ff497fdbaefea3235a8e253d8279bb
- Grafana dashboard (auto-refresh)
- Humio Logs
- Test runner output
- Test run is land-blocking
Forge is running suite land_blocking on 46222aaac0ff497fdbaefea3235a8e253d8279bb
- Grafana dashboard (auto-refresh)
- Humio Logs
- Test runner output
- Test run is land-blocking
:white_check_mark: Forge suite land_blocking success on 46222aaac0ff497fdbaefea3235a8e253d8279bb
performance benchmark with full nodes : 5545 TPS, 7162 ms latency, 14700 ms p99 latency,(!) expired 420 out of 2368560 txns
Test Ok
- Grafana dashboard
- Humio Logs
- Test runner output
- Test run is land-blocking
:white_check_mark: Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 46222aaac0ff497fdbaefea3235a8e253d8279bb
Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 46222aaac0ff497fdbaefea3235a8e253d8279bb (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7223 TPS, 5388 ms latency, 8200 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 46222aaac0ff497fdbaefea3235a8e253d8279bb
compatibility::simple-validator-upgrade::single-validator-upgrade : 4206 TPS, 9665 ms latency, 13000 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 46222aaac0ff497fdbaefea3235a8e253d8279bb
compatibility::simple-validator-upgrade::half-validator-upgrade : 4519 TPS, 9269 ms latency, 12800 ms p99 latency,no expired txns
4. upgrading second batch to new version: 46222aaac0ff497fdbaefea3235a8e253d8279bb
compatibility::simple-validator-upgrade::rest-validator-upgrade : 5988 TPS, 6622 ms latency, 10000 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 46222aaac0ff497fdbaefea3235a8e253d8279bb passed
Test Ok
- Grafana dashboard
- Humio Logs
- Test runner output
- Test run is land-blocking