hedera-services icon indicating copy to clipboard operation
hedera-services copied to clipboard

(HashConf) run deep reconnect tests 10+ nodes

Open alex-kuzmin-hg opened this issue 5 months ago • 0 comments

Per HashConf 2024 brainstorming sessions:

Artem Ananev 8:17 AM Hi Alex and the team. Let me try to summarize what we discussed about reconnect testing in the last few days: Step 0: configure a new network in latitude to run consensus nodes and Oleg’s load generator. More nodes the better, but more than 10-11 probably wouldn’t add much value. Load generator at this step can be used to generate a state (accounts and tokens), and after that to start NFT transfers

Step 1: run develop with 40/40 state and TPS limited to 5K. Only a small fraction of these 40M accounts should be hot, like 1M or even less, other accounts will be not very active. Please, check with Oleg how to configure that part. The nodes should be running stable at this point

Step 1a: after the state is fully generated, and NFT transfers are in progress at 5K for a few minutes (e.g. 10 mins), shut down one node and start it back in 10 mins. This will make the node start a reconnect process. It would be great to have this stop/restart process automated, since this is a crucial part of reconnect testing

Step 1b: if reconnect is successful, repeat step 1a a few times The next steps will depend on steps 1/1a results. If the node is able to reconnect, we will increase the TPS (ideally, to 10K) and/or increase state size (to 100M, ideally to 1B) and/or increase node shutdown period (15 mins, 30 mins, 1 hour, 3 hours). If reconnects fail, we will need to check why. It could be because of reconnects themselves, or because of the health monitor, or something entirely different Once Oleg prepares a small fix for the health monitor to lower its resolution (to run every 1ms instead of 100ms) as we discussed, it will make sense to use that branch for testing. It should help a little bit with the final “catching up” part of the reconnect process Once my changes for QueueNode and in-memory virtual maps are available, it will also make sense to test them, since we expect it will have positive impact on reconnects (the reconnect part) Does it look like a good plan? Any comments? Thanks!

alex-kuzmin-hg avatar Sep 30 '24 14:09 alex-kuzmin-hg