hedera-services
hedera-services copied to clipboard
Health monitor is not efficient
Problem
Below are metrics collected from a single node processing NftTransferLoadTest
at full speed.
Ideally, transaction handling, being the actual bottleneck, should be 100% busy all the time and the unhandled task queue should fluctuate around a steady number. That's not the case in this example as the health monitor discovers an unhealthy state with a time lag and its reaction is too harsh emptying the incoming queue for a prolonged period of time and allowing the handling thread to go idle.
time trans_per_sec unhealthyDuration TransactionHandler_unhandled_task_count TransactionHandler_busy_fraction
--------------------------------------------------------------------------------------------------------------------------------
2024-07-25 15:25:06 UTC 6441.49 0.200 3 0.743
2024-07-25 15:25:09 UTC 6309.36 0.000 1 0.720
2024-07-25 15:25:12 UTC 6711.45 1.100 63 0.951
2024-07-25 15:25:15 UTC 5758.57 0.200 38 1.000
2024-07-25 15:25:18 UTC 5132.63 3.200 47 1.000
2024-07-25 15:25:21 UTC 4167.24 6.200 37 1.000
2024-07-25 15:25:24 UTC 3537.85 2.300 31 1.000
2024-07-25 15:25:27 UTC 3083.83 0.000 19 0.928
2024-07-25 15:25:30 UTC 3916.83 0.000 1 0.916
2024-07-25 15:25:33 UTC 4591.21 0.000 14 0.895
2024-07-25 15:25:36 UTC 4770.38 0.000 5 0.884
2024-07-25 15:25:39 UTC 4898.66 0.000 11 0.912
2024-07-25 15:25:42 UTC 5509.99 0.500 36 1.000
2024-07-25 15:25:45 UTC 5189.37 0.000 2 0.917
2024-07-25 15:25:48 UTC 5616.71 0.000 9 0.886
2024-07-25 15:25:51 UTC 6059.27 0.000 0 0.998
2024-07-25 15:25:54 UTC 6000.89 0.000 0 0.624
2024-07-25 15:25:57 UTC 5880.47 0.000 0 0.445
2024-07-25 15:26:00 UTC 6030.48 0.000 8 0.843
2024-07-25 15:26:03 UTC 6029.67 0.000 1 0.956
2024-07-25 15:26:06 UTC 6273.28 0.000 2 0.953
2024-07-25 15:26:09 UTC 6356.39 0.000 1 0.852
2024-07-25 15:26:12 UTC 6229.92 0.300 43 0.888
2024-07-25 15:26:15 UTC 5957.35 0.000 18 1.000
2024-07-25 15:26:18 UTC 5992.18 0.000 1 0.917
2024-07-25 15:26:21 UTC 6115.42 0.000 7 0.732
2024-07-25 15:26:24 UTC 6087.66 0.000 1 0.828
2024-07-25 15:26:27 UTC 6297.59 0.000 1 0.861
2024-07-25 15:26:30 UTC 6276.00 0.000 1 0.578
2024-07-25 15:26:33 UTC 6208.90 0.000 1 0.679
2024-07-25 15:26:36 UTC 6344.00 0.000 17 0.755
2024-07-25 15:26:39 UTC 5551.75 2.400 40 1.000
2024-07-25 15:26:42 UTC 4988.60 1.900 50 1.000
2024-07-25 15:26:45 UTC 4575.24 0.000 30 1.000
2024-07-25 15:26:48 UTC 4900.30 0.000 28 1.000
2024-07-25 15:26:51 UTC 5115.38 0.000 19 1.000
2024-07-25 15:26:54 UTC 5181.09 0.900 60 1.000
2024-07-25 15:26:57 UTC 4645.69 0.000 8 1.000
2024-07-25 15:27:00 UTC 4971.97 0.000 2 0.606
2024-07-25 15:27:03 UTC 5178.77 0.000 1 0.562
2024-07-25 15:27:06 UTC 5336.48 1.000 62 0.853
2024-07-25 15:27:09 UTC 4618.90 0.000 20 1.000
2024-07-25 15:27:12 UTC 4871.70 0.000 17 0.915
2024-07-25 15:27:15 UTC 4871.62 0.000 20 1.000
2024-07-25 15:27:18 UTC 4990.09 0.000 1 0.761
2024-07-25 15:27:21 UTC 5079.62 0.000 1 0.574
2024-07-25 15:27:24 UTC 5145.22 0.000 0 0.657
2024-07-25 15:27:27 UTC 5203.69 0.000 1 0.549
2024-07-25 15:27:30 UTC 5249.63 0.000 0 0.528
2024-07-25 15:27:33 UTC 5160.38 0.000 1 0.454
2024-07-25 15:27:36 UTC 5267.93 0.000 0 0.894
2024-07-25 15:27:39 UTC 5360.08 0.000 0 0.618
Solution
Tuning the health monitor may help to increase throughput in the short run. A longer term solution requires a better mechanism to limit buffering unhandled tasks for transaction handling and the entire pipeline.
Alternatives
No response