nethermind icon indicating copy to clipboard operation
nethermind copied to clipboard

Excessive resource utilization when ethstats reporting is enabled

Open nicexe opened this issue 3 years ago • 15 comments

Describe the bug Nethermind is using excessive resources

To Reproduce Steps to reproduce the behavior:

  1. Setup an AuRa network such xdaichain/posdao-testnet-prepare
  2. Before launching the nodes for the first time (./run_all.sh), segregate the nodes into a different server for each one.
    1. Modify each node's docker-compose.yml and update NETHERMIND_NETWORKCONFIG_EXTERNALIP and NETHERMIND_ETHSTATSCONFIG_SERVER environment variables
    2. Optionally modify run_all.sh and stop_all.sh to only run/stop the specific node or ethstats for each server
  3. Start nodes
  4. On archive server, deploy staking contract cd .. # go back to the `compose` directory and docker-compose start && docker-compose logs -f
  5. Observe system load. This is the baseline with ethstats reporting enabled
  6. Stop the nodes
  7. Modify each node's docker-compose.yml and update NETHERMIND_ETHSTATSCONFIG_ENABLED to false
  8. Observe system load. This is with ethstats reporting disabled

Screenshots Here you can see some system metrics with ethstats reporting enabled Focus on system load average The only outlier is validator1 which seems to report the wrong information yet has the least resource utilization Screenshot 2022-03-22 at 4 40 14 PM image Here you can see some system metrics with ethstats reporting disabled Screenshot 2022-03-22 at 5 01 07 PM

System Information:

Additional notes I cannot really explain the discrepancy between validator1 and the other validators in the ethstats enabled scenario. All validators have almost identical environment variables set. They only differ on node name, keys and external ip. Each node lives on it own VM, each on a different host machine, some host machines on a different datacenters. All VMs have identical specs.

nicexe avatar Mar 22 '22 15:03 nicexe

In the ethstats enabled scenario, validators are starting to struggle to parse/produce blocks on time and this results in lots of benign AuRa reports

nicexe avatar Mar 22 '22 15:03 nicexe

Logs in the ethstats disabled scenario: Screenshot 2022-03-22 at 5 33 09 PM

Logs in the ethstats enabled scenario Screenshot 2022-03-22 at 5 47 10 PM

nicexe avatar Mar 22 '22 16:03 nicexe

I forgot to mention the VM orientation shown in my screenshots:

archive validator1 validator2
validator3 validator4 validator5

nicexe avatar Mar 23 '22 08:03 nicexe

@nicexe Did you try that with v1.12.3?

varasev avatar Mar 24 '22 10:03 varasev

Hi @varasev thanks for the reply. I just tried 1.12.3 but the nodes cannot start anymore and I'm getting those 2 errors:

2022-03-24 13:55:05.4288|ERROR|13|Step InitializeBlockchainAuRa failed after 151ms Nethermind.Trie.TrieException: Node 0x3adfe602f80cfa36db0ea3b41a64819071037a685399bf9381508d255e8d9f71 is missing from the DB
   at Nethermind.Trie.Pruning.TrieStore.LoadRlp(Keccak keccak, IKeyValueStore keyValueStore) in /src/Nethermind/Nethermind.Trie/Pruning/TrieStore.cs:line 331
   at Nethermind.Trie.Pruning.ReadOnlyTrieStore.LoadRlp(Keccak hash) in /src/Nethermind/Nethermind.Trie/Pruning/ReadOnlyTrieStore.cs:line 41
   at Nethermind.Trie.TrieNode.ResolveNode(ITrieNodeResolver tree) in /src/Nethermind/Nethermind.Trie/TrieNode.cs:line 234
   at Nethermind.Trie.PatriciaTree.Run(Span`1 updatePath, Int32 nibblesCount, Byte[] updateValue, Boolean isUpdate, Boolean ignoreMissingDelete, Keccak startRootHash) in /src/Nethermind/Nethermind.Trie/PatriciaTree.cs:line 401
   at Nethermind.Trie.PatriciaTree.Get(Span`1 rawKey, Keccak rootHash) in /src/Nethermind/Nethermind.Trie/PatriciaTree.cs:line 319
   at Nethermind.State.StateTree.Get(Address address, Keccak rootHash) in /src/Nethermind/Nethermind.State/StateTree.cs:line 49
   at Nethermind.State.StateProvider.GetState(Address address) in /src/Nethermind/Nethermind.State/StateProvider.cs:line 659
   at Nethermind.State.StateProvider.GetAndAddToCache(Address address) in /src/Nethermind/Nethermind.State/StateProvider.cs:line 674
   at Nethermind.State.StateProvider.GetThroughCache(Address address) in /src/Nethermind/Nethermind.State/StateProvider.cs:line 695
   at Nethermind.State.StateProvider.GetCodeHash(Address address) in /src/Nethermind/Nethermind.State/StateProvider.cs:line 311
   at Nethermind.State.IReadOnlyStateProvider.IsContract(Address address) in /src/Nethermind/Nethermind.State/IReadOnlyStateProvider.cs:line 39
   at Nethermind.Evm.TransactionProcessing.ReadOnlyTransactionProcessor.IsContractDeployed(Address address) in /src/Nethermind/Nethermind.Evm/TransactionProcessing/ReadOnlyTransactionProcessor.cs:line 55
   at Nethermind.Blockchain.Contracts.Contract.ConstantContract.CallRaw(CallInfo callInfo, IReadOnlyTransactionProcessor readOnlyTransactionProcessor) in /src/Nethermind/Nethermind.Blockchain/Contracts/Contract.ConstantContract.cs:line 111
   at Nethermind.Blockchain.Contracts.Contract.ConstantContract.Call(CallInfo callInfo) in /src/Nethermind/Nethermind.Blockchain/Contracts/Contract.ConstantContract.cs:line 104
   at Nethermind.Blockchain.Contracts.Contract.IConstantContract.Call[T](CallInfo callInfo) in /src/Nethermind/Nethermind.Blockchain/Contracts/Contract.ConstantContract.cs:line 43
   at Nethermind.Blockchain.Contracts.Contract.IConstantContract.Call[T](BlockHeader parentHeader, String functionName, Address sender, Object[] arguments) in /src/Nethermind/Nethermind.Blockchain/Contracts/Contract.ConstantContract.cs:line 52
   at Nethermind.Consensus.AuRa.Contracts.ValidatorContract.GetValidators(BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Contracts/ValidatorContract.cs:line 104
   at Nethermind.Consensus.AuRa.Validators.ContractBasedValidator.LoadValidatorsFromContract(BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/ContractBasedValidator.cs:line 246
   at Nethermind.Consensus.AuRa.Validators.ContractBasedValidator.SetFinalizationManager(IAuRaBlockFinalizationManager finalizationManager, BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/ContractBasedValidator.cs:line 84
   at Nethermind.Consensus.AuRa.Validators.ContractBasedValidator..ctor(IValidatorContract validatorContract, IBlockTree blockTree, IReceiptFinder receiptFinder, IValidatorStore validatorStore, IValidSealerStrategy validSealerStrategy, IAuRaBlockFinalizationManager finalizationManager, BlockHeader parentHeader, ILogManager logManager, Int64 startBlockNumber, Int64 posdaoTransition, Boolean forSealing) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/ContractBasedValidator.cs:line 71
   at Nethermind.Consensus.AuRa.AuRaValidatorFactory.<CreateValidatorProcessor>g__GetContractBasedValidator|19_2(<>c__DisplayClass19_0& ) in /src/Nethermind/Nethermind.Consensus.AuRa/AuRaValidatorFactory.cs:line 104
   at Nethermind.Consensus.AuRa.AuRaValidatorFactory.CreateValidatorProcessor(Validator validator, BlockHeader parentHeader, Nullable`1 startBlock) in /src/Nethermind/Nethermind.Consensus.AuRa/AuRaValidatorFactory.cs:line 131
   at Nethermind.Consensus.AuRa.Validators.MultiValidator.CreateValidator(Int64 finalizedAtBlockNumber, Validator validatorPrototype, BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/MultiValidator.cs:line 222
   at Nethermind.Consensus.AuRa.Validators.MultiValidator.SetCurrentValidator(Int64 finalizedAtBlockNumber, Validator validatorPrototype, BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/MultiValidator.cs:line 204
   at Nethermind.Consensus.AuRa.Validators.MultiValidator.SetCurrentValidator(KeyValuePair`2 validatorInfo, BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/MultiValidator.cs:line 196
   at Nethermind.Consensus.AuRa.Validators.MultiValidator.InitCurrentValidator(Int64 blockNumber, BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/MultiValidator.cs:line 75
   at Nethermind.Consensus.AuRa.Validators.MultiValidator.SetFinalizationManager(IAuRaBlockFinalizationManager finalizationManager, BlockHeader parentHeader) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/MultiValidator.cs:line 186
   at Nethermind.Consensus.AuRa.Validators.MultiValidator..ctor(Validator validator, IAuRaValidatorFactory validatorFactory, IBlockTree blockTree, IValidatorStore validatorStore, IAuRaBlockFinalizationManager finalizationManager, BlockHeader parentHeader, ILogManager logManager, Boolean forSealing) in /src/Nethermind/Nethermind.Consensus.AuRa/Validators/MultiValidator.cs:line 66
   at Nethermind.Consensus.AuRa.AuRaValidatorFactory.CreateValidatorProcessor(Validator validator, BlockHeader parentHeader, Nullable`1 startBlock) in /src/Nethermind/Nethermind.Consensus.AuRa/AuRaValidatorFactory.cs:line 145
   at Nethermind.Consensus.AuRa.InitializationSteps.InitializeBlockchainAuRa.CreateAuRaValidator(IBlockProcessor processor, IReadOnlyTxProcessorSource readOnlyTxProcessorSource) in /src/Nethermind/Nethermind.Consensus.AuRa/InitializationSteps/InitializeBlockchainAuRa.cs:line 131
   at Nethermind.Consensus.AuRa.InitializationSteps.InitializeBlockchainAuRa.CreateBlockProcessor() in /src/Nethermind/Nethermind.Consensus.AuRa/InitializationSteps/InitializeBlockchainAuRa.cs:line 94
   at Nethermind.Init.Steps.InitializeBlockchain.InitBlockchain() in /src/Nethermind/Nethermind.Init/Steps/InitializeBlockchain.cs:line 220
   at Nethermind.Init.Steps.InitializeBlockchain.Execute(CancellationToken _) in /src/Nethermind/Nethermind.Init/Steps/InitializeBlockchain.cs:line 69
2022-03-24 13:55:05.4476|ERROR|10|Error during ethereum runner start Nethermind.Init.Steps.StepDependencyException: BlockchainProcessor
   at Nethermind.Init.Steps.StartBlockProcessor.Execute(CancellationToken _) in /src/Nethermind/Nethermind.Init/Steps/StartBlockProcessor.cs:line 37
   at Nethermind.Init.Steps.EthereumStepsManager.RunOneRoundOfInitialization(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Init/Steps/EthereumStepsManager.cs:line 140
   at Nethermind.Init.Steps.EthereumStepsManager.InitializeAll(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Init/Steps/EthereumStepsManager.cs:line 108
   at Nethermind.Runner.Ethereum.EthereumRunner.Start(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Runner/Ethereum/EthereumRunner.cs:line 51

I'll try keeping all nodes to latest except one where I would configure it to use 1.12.3 and sync from 0

nicexe avatar Mar 24 '22 14:03 nicexe

@varasev 1.12.3 definitely helped

system load image

ethstats image

I restored the archive node and validators from 1 to 4 and configured them to use 1.12.6. I removed all data for validator 5 and configured it to use 1.12.3.

After validator 5 synced with the rest of the nodes I let it rest for a few minutes and then checked the system load as seen in the screenshot.

Validator 5 has a system load average of about 1 on a 4 core VM and the other nodes have a system load average of about 8-10 on a 4 core VM each.

Again, the only outlier is validator 1 which has significantly lower CPU utilization and system load average of about 2 on a 4 core VM but it is also the only node reporting invalid stats to the ethstat server.

nicexe avatar Mar 24 '22 14:03 nicexe

Hi @nicexe looking into your issue, if 1.12.3 solves it then there is slight chance that the actual reason is different and related to: https://github.com/NethermindEth/nethermind/pull/3901

LukaszRozmej avatar Mar 24 '22 21:03 LukaszRozmej

@nicexe Can you try to run the setup using the nethermindeth/nethermind:local_txs_broadcasting docker image? It is the build from https://github.com/NethermindEth/nethermind/pull/3901

varasev avatar Mar 27 '22 06:03 varasev

Also pushed #3906

LukaszRozmej avatar Mar 28 '22 10:03 LukaszRozmej

I tried to use nethermindeth/nethermind:local_txs_broadcasting on validator 5. This brings the VM that validator 5 runs on to a system load of about 9 and 100% CPU utilization just like the rest of the nodes as opposed to a system load of about 1 when using 1.12.3

nicexe avatar Mar 28 '22 12:03 nicexe

Also pushed #3906

@LukaszRozmej Can you please make a docker image for this?

varasev avatar Mar 29 '22 08:03 varasev

@nicexe I fail to actually reproduce this: image image

Would you be able to provide a performance trace from one of the misbehaving nodes? I am inquiring our devops on how to do that on nodes running inside docker as that is more complicated.

@matilote would you be able to provide a docker image + instructions on how to use dottrace here?

LukaszRozmej avatar Mar 30 '22 07:03 LukaszRozmej

@nicexe can you help us gather data here?

  1. Please use image nethermindeth/nethermind:test
  2. After running it on one or multiple misbehaving nodes please do: sudo docker exec -it xdaitestnet-validatorX /bin/bash mkdir dumps /opt/tools/dottrace attach PID --save-to=dumps
  3. After some time like a minute or two close dottrace with Ctrl+C, it will take the snapshot and save it to dumps folder
  4. Now you need to send us the dump, you can do it different ways. one potential is: apt update apt install curl apt install zip zip -9 -r perf.zip dump curl --upload-file ./perf.zip https://transfer.sh/perf.zip The last one would give you url with the file being available for download.

LukaszRozmej avatar Mar 31 '22 09:03 LukaszRozmej

@nicexe we found one potential issue with our ethstats integration. It could be connected with your problem. We will release the fix soon. The fix is merged to master

MarekM25 avatar Jul 08 '22 08:07 MarekM25

#4249

LukaszRozmej avatar Jul 11 '22 12:07 LukaszRozmej