casper-node
casper-node copied to clipboard
Spike: Review non-histogram metrics to deteremine if they should be histograms
Correlating different time-related metrics can provide insights into how to tune certain parameters or indicate issues. We need to have a review off all time-related metrics and make sure we use histograms for them. Have a pass through all these time related metrics that are related and decide if we should normalize their parameters (number of buckets, bucket witdths). Look for correlation between this metrics and add some determinations or other compound metrics that would show anomalies given system invariants. For example block time should be constant (or at least within a certain range of lanes) so correlating time based metrics from the components involved in the production of blocks (block execution, finalized block creation, event queue latency for the related messages etc) would provide insights of if and how we reach the required deadline.
Deliverable is finding which metrics are currently useful to engineers and which are not, and if any are missing.
@alsrdn needs an estimate
where are we NOT using histograms but should be? What other kind of trend analysis should we do? ' The goal is to have a minimum number of metrics that are the most impactful for drawing conclusions.
Possible Teonite ticket? @devendran-m
List of all metrics + type:
Click to expand
| No. | Metric Name | Type |
|---|---|---|
| 1 | accumulated_incoming_limiter_delay | counter |
| 2 | net_in_bytes_protocol | counter |
| 3 | net_in_bytes_trie_transfer | counter |
| 4 | net_in_count_address_gossip | counter |
| 5 | net_in_count_block_gossip | counter |
| 6 | net_in_count_block_transfer | counter |
| 7 | net_in_count_consensus | counter |
| 8 | net_in_bytes_other | counter |
| 9 | net_in_count_deploy_gossip | counter |
| 10 | net_in_count_finality_signature_gossip | counter |
| 11 | net_in_count_other | counter |
| 12 | net_in_count_protocol | counter |
| 13 | net_in_count_trie_transfer | counter |
| 14 | net_out_bytes_address_gossip | counter |
| 15 | net_out_bytes_block_gossip | counter |
| 16 | net_in_count_deploy_transfer | counter |
| 17 | net_out_bytes_block_transfer | counter |
| 18 | net_in_bytes_finality_signature_gossip | counter |
| 19 | net_in_bytes_deploy_gossip | counter |
| 20 | finality_signature_gossiper_items_received | counter |
| 21 | finality_signature_gossiper_times_gossiped | counter |
| 22 | finality_signature_gossiper_times_ran_out_of_peers | counter |
| 23 | legacy_deploy_fetch_total | counter |
| 24 | legacy_deploy_found_in_storage | counter |
| 25 | legacy_deploy_found_on_peer | counter |
| 26 | net_in_bytes_deploy_transfer | counter |
| 27 | legacy_deploy_timeouts | counter |
| 28 | net_broadcast_requests | counter |
| 29 | net_direct_message_requests | counter |
| 30 | net_in_bytes_address_gossip | counter |
| 31 | net_in_bytes_block_gossip | counter |
| 32 | net_in_bytes_block_transfer | counter |
| 33 | net_in_bytes_consensus | counter |
| 34 | trie_or_chunk_found_on_peer | counter |
| 35 | net_out_bytes_consensus | counter |
| 36 | net_out_bytes_deploy_gossip | counter |
| 37 | net_out_bytes_deploy_transfer | counter |
| 38 | sync_leap_fetcher_fetch_total | counter |
| 39 | sync_leap_fetcher_found_in_storage | counter |
| 40 | sync_leap_fetcher_found_on_peer | counter |
| 41 | sync_leap_fetcher_timeouts | counter |
| 42 | sync_leap_rejected_by_peer_total | counter |
| 43 | transaction_fetch_total | counter |
| 44 | sync_leap_fetched_from_peer_total | counter |
| 45 | transaction_found_in_storage | counter |
| 46 | transaction_gossiper_items_received | counter |
| 47 | transaction_gossiper_times_gossiped | counter |
| 48 | transaction_gossiper_times_ran_out_of_peers | counter |
| 49 | transaction_timeouts | counter |
| 50 | trie_or_chunk_fetch_total | counter |
| 51 | trie_or_chunk_found_in_storage | counter |
| 52 | transaction_found_on_peer | counter |
| 53 | sync_leap_cant_fetch_total | counter |
| 54 | runner_events | counter |
| 55 | requests_for_trie_finished | counter |
| 56 | net_out_bytes_finality_signature_gossip | counter |
| 57 | net_out_bytes_other | counter |
| 58 | net_out_bytes_protocol | counter |
| 59 | net_out_bytes_trie_transfer | counter |
| 60 | net_out_count_address_gossip | counter |
| 61 | net_out_count_block_gossip | counter |
| 62 | net_out_count_block_transfer | counter |
| 63 | net_out_count_consensus | counter |
| 64 | net_out_count_deploy_gossip | counter |
| 65 | net_out_count_deploy_transfer | counter |
| 66 | net_out_count_finality_signature_gossip | counter |
| 67 | net_out_count_other | counter |
| 68 | net_out_count_protocol | counter |
| 69 | net_out_count_trie_transfer | counter |
| 70 | requests_for_trie_accepted | counter |
| 71 | finality_signature_fetcher_timeouts | counter |
| 72 | finality_signature_fetcher_found_on_peer | counter |
| 73 | trie_or_chunk_timeouts | counter |
| 74 | finality_signature_fetcher_fetch_total | counter |
| 75 | block_header_found_on_peer | counter |
| 76 | block_header_found_in_storage | counter |
| 77 | block_header_fetch_total | counter |
| 78 | finality_signature_fetcher_found_in_storage | counter |
| 79 | block_gossiper_times_gossiped | counter |
| 80 | block_gossiper_items_received | counter |
| 81 | block_found_on_peer | counter |
| 82 | block_found_in_storage | counter |
| 83 | block_fetch_total | counter |
| 84 | block_execution_results_or_chunk_fetcher_timeouts | counter |
| 85 | block_execution_results_or_chunk_fetcher_found_on_peer | counter |
| 86 | block_execution_results_or_chunk_fetcher_found_in_storage | counter |
| 87 | block_execution_results_or_chunk_fetcher_fetch_total | counter |
| 88 | binary_port_try_speculative_exec_count | counter |
| 89 | binary_port_try_accept_transaction_count | counter |
| 90 | binary_port_get_state_count | counter |
| 91 | binary_port_get_record_count | counter |
| 92 | binary_port_get_info_count | counter |
| 93 | binary_port_connections_count | counter |
| 94 | approvals_hashes_timeouts | counter |
| 95 | approvals_hashes_found_on_peer | counter |
| 96 | approvals_hashes_found_in_storage | counter |
| 97 | approvals_hashes_fetch_total | counter |
| 98 | address_gossiper_times_ran_out_of_peers | counter |
| 99 | address_gossiper_times_gossiped | counter |
| 100 | address_gossiper_items_received | counter |
| 101 | accumulated_outgoing_limiter_delay | counter |
| 102 | block_header_timeouts | counter |
| 103 | block_timeouts | counter |
| 104 | block_gossiper_times_ran_out_of_peers | counter |
| 105 | mem_rest_server | gauge |
| 106 | scheduler_queue_network_demands_count | gauge |
| 107 | scheduler_queue_network_count | gauge |
| 108 | scheduler_queue_gossip_count | gauge |
| 109 | scheduler_queue_from_storage_count | gauge |
| 110 | scheduler_queue_finality_signature_count | gauge |
| 111 | scheduler_queue_fetch_count | gauge |
| 112 | scheduler_queue_control_count | gauge |
| 113 | scheduler_queue_contract_runtime_count | gauge |
| 114 | scheduler_queue_consensus_count | gauge |
| 115 | scheduler_queue_network_incoming_count | gauge |
| 116 | scheduler_queue_api_count | gauge |
| 117 | block_accumulator_block_acceptors | gauge |
| 118 | peers | gauge |
| 119 | out_state_waiting | gauge |
| 120 | out_state_loopback | gauge |
| 121 | out_state_connecting | gauge |
| 122 | mem_net | gauge |
| 123 | out_state_blocked | gauge |
| 124 | net_queued_direct_messages | gauge |
| 125 | block_accumulator_known_child_blocks | gauge |
| 126 | contract_runtime_latest_commit_step | gauge |
| 127 | scheduler_queue_network_info_count | gauge |
| 128 | scheduler_queue_network_low_priority_count | gauge |
| 129 | scheduler_queue_regular_count | gauge |
| 130 | execution_queue_size | gauge |
| 131 | address_gossiper_table_items_current | gauge |
| 132 | address_gossiper_table_items_finished | gauge |
| 133 | finality_signature_gossiper_table_items_current | gauge |
| 134 | transaction_gossiper_table_items_finished | gauge |
| 135 | transaction_gossiper_table_items_current | gauge |
| 136 | finality_signature_gossiper_table_items_finished | gauge |
| 137 | allocated_ram_bytes | gauge |
| 138 | amount_of_blocks | gauge |
| 139 | transaction_buffer_total_transactions | gauge |
| 140 | transaction_buffer_held_transactions | gauge |
| 141 | transaction_buffer_dead_transactions | gauge |
| 142 | total_ram_bytes | gauge |
| 143 | time_of_last_finalized_block | gauge |
| 144 | time_of_last_block_payload | gauge |
| 145 | finalization_time | gauge |
| 146 | highest_available_block_height | gauge |
| 147 | scheduler_queue_validation_count | gauge |
| 148 | scheduler_queue_total_count | gauge |
| 149 | scheduler_queue_to_storage_count | gauge |
| 150 | scheduler_queue_sync_global_state_count | gauge |
| 151 | lowest_available_block_height | gauge |
| 152 | mem_address_gossiper | gauge |
| 153 | out_state_connected | gauge |
| 154 | mem_block_accumulator | gauge |
| 155 | mem_storage | gauge |
| 156 | mem_sync_leaper | gauge |
| 157 | mem_total | gauge |
| 158 | mem_transaction_acceptor | gauge |
| 159 | mem_transaction_buffer | gauge |
| 160 | mem_transaction_gossiper | gauge |
| 161 | mem_upgrade_watcher | gauge |
| 162 | mem_binary_port | gauge |
| 163 | consumed_ram_bytes | gauge |
| 164 | consensus_current_era | gauge |
| 165 | chain_height | gauge |
| 166 | mem_metrics | gauge |
| 167 | mem_rpc_server | gauge |
| 168 | mem_fetchers | gauge |
| 169 | mem_diagnostics_port | gauge |
| 170 | mem_contract_runtime | gauge |
| 171 | block_gossiper_table_items_finished | gauge |
| 172 | block_gossiper_table_items_current | gauge |
| 173 | mem_consensus | gauge |
| 174 | mem_finality_signature_gossiper | gauge |
| 175 | mem_block_synchronizer | gauge |
| 176 | mem_block_gossiper | gauge |
| 177 | mem_event_stream_server | gauge |
| 178 | mem_block_validator | gauge |
| 179 | event_dispatch_duration | histogram |
| 180 | contract_runtime_execution_results_checksum | histogram |
| 181 | contract_runtime_txn_approvals_hashes_calculation | histogram |
| 182 | contract_runtime_get_seigniorage_recipients | histogram |
| 183 | contract_runtime_get_round_seigniorage_rate | histogram |
| 184 | contract_runtime_get_trie | histogram |
| 185 | contract_runtime_get_era_validators | histogram |
| 186 | transaction_acceptor_rejected_transaction | histogram |
| 187 | transaction_acceptor_accepted_transaction | histogram |
| 188 | contract_runtime_get_balance | histogram |
| 189 | contract_runtime_get_all_values | histogram |
| 190 | contract_runtime_get_total_supply | histogram |
| 191 | contract_runtime_execute_block | histogram |
| 192 | forward_block_sync_duration_seconds | histogram |
| 193 | contract_runtime_put_trie | histogram |
| 194 | mem_estimator_runtime_s | histogram |
| 195 | contract_runtime_addressable_entity | histogram |
| 196 | contract_runtime_batch_pruning_time | histogram |
| 197 | contract_runtime_block_rewards_payout | histogram |
| 198 | contract_runtime_commit_genesis | histogram |
| 199 | contract_runtime_commit_step | histogram |
| 200 | contract_runtime_commit_upgrade | histogram |
| 201 | contract_runtime_scratch_lmdb_write_time | histogram |
| 202 | contract_runtime_db_flush_time | histogram |
| 203 | contract_runtime_exec_block_post_proc | histogram |
| 204 | contract_runtime_exec_block_pre_proc | histogram |
| 205 | contract_runtime_exec_block_step_proc | histogram |
| 206 | contract_runtime_run_query | histogram |
| 207 | sync_leap_duration_seconds | histogram |
| 208 | contract_runtime_run_query_by_prefix | histogram |
| 209 | historical_block_sync_duration_seconds | histogram |
| 210 | contract_runtime_exec_wasm_v1 | histogram |
| 211 | contract_runtime_entry_point | histogram |
| 212 | contract_runtime_exec_block_total_proc | histogram |
Ed gave the following direction on the L1 Daily on 17th Sep;
- Determine if any of the counters and gauges should not be turned into histograms, and then turn all the remaining ones into histograms.
- Validate and adjust bucket sizing for metrics.
Could be changed to histograms:
-
accumulated_incoming_limiter_delay (counter) Track the distribution of delay times experienced in the incoming traffic to the node [0.1, 0.5, 1, 2, 5, 10] seconds.
-
accumulated_outgoing_limiter_delay (counter) Distribution of delay times in the outgoing traffic. [0.1, 0.5, 1, 2, 5, 10] seconds.
Hi @sacherjj ,
Ed gave the following direction on the L1 Daily on 17th Sep. Could you help look into this, please?
- Determine if any of the counters and gauges should not be turned into histograms, and then turn all the remaining ones into histograms.
- Validate and adjust bucket sizing for metrics.
To low of buckets (takes MUCH longer on network with tons of delegations):
# HELP contract_runtime_commit_step time in seconds to commit the step at era end
# TYPE contract_runtime_commit_step histogram
contract_runtime_commit_step_bucket{le="0.01"} 0
contract_runtime_commit_step_bucket{le="0.02"} 0
contract_runtime_commit_step_bucket{le="0.04"} 0
contract_runtime_commit_step_bucket{le="0.08"} 0
contract_runtime_commit_step_bucket{le="0.16"} 0
contract_runtime_commit_step_bucket{le="0.32"} 383
contract_runtime_commit_step_bucket{le="0.64"} 383
contract_runtime_commit_step_bucket{le="1.28"} 383
contract_runtime_commit_step_bucket{le="2.56"} 383
contract_runtime_commit_step_bucket{le="5.12"} 383
contract_runtime_commit_step_bucket{le="+Inf"} 383
contract_runtime_commit_step_sum 73.34488397599996
contract_runtime_commit_step_count 383
Convert to:
# HELP contract_runtime_commit_step time in seconds to commit the step at era end
# TYPE contract_runtime_commit_step histogram
contract_runtime_commit_step_bucket{le="0.2"} 0
contract_runtime_commit_step_bucket{le="0.4"} 0
contract_runtime_commit_step_bucket{le="0.8"} 0
contract_runtime_commit_step_bucket{le="1.6"} 0
contract_runtime_commit_step_bucket{le="3.2"} 383
contract_runtime_commit_step_bucket{le="6.4"} 383
contract_runtime_commit_step_bucket{le="12.8"} 383
contract_runtime_commit_step_bucket{le="25.6"} 383
contract_runtime_commit_step_bucket{le="51.2"} 383
contract_runtime_commit_step_bucket{le="+Inf"} 383
contract_runtime_commit_step_sum 73.34488397599996
contract_runtime_commit_step_count 383
Not sure what this metric is, but if in seconds do we expect to have a transaction that takes an hour and a half to get accepted? Seems like we need smaller numbers, but I don't understand the metric fully.
# HELP transaction_acceptor_accepted_transaction time in seconds to accept a transaction in the transaction acceptor
# TYPE transaction_acceptor_accepted_transaction histogram
transaction_acceptor_accepted_transaction_bucket{le="10"} 562
transaction_acceptor_accepted_transaction_bucket{le="20"} 575
transaction_acceptor_accepted_transaction_bucket{le="40"} 575
transaction_acceptor_accepted_transaction_bucket{le="80"} 578
transaction_acceptor_accepted_transaction_bucket{le="160"} 579
transaction_acceptor_accepted_transaction_bucket{le="320"} 579
transaction_acceptor_accepted_transaction_bucket{le="640"} 579
transaction_acceptor_accepted_transaction_bucket{le="1280"} 579
transaction_acceptor_accepted_transaction_bucket{le="2560"} 579
transaction_acceptor_accepted_transaction_bucket{le="5120"} 579
transaction_acceptor_accepted_transaction_bucket{le="+Inf"} 579
transaction_acceptor_accepted_transaction_sum 3076
transaction_acceptor_accepted_transaction_count 579
Reduce:
# HELP mem_estimator_runtime_s time in seconds to estimate memory usage
# TYPE mem_estimator_runtime_s histogram
mem_estimator_runtime_s_bucket{le="0.000000004"} 0
mem_estimator_runtime_s_bucket{le="0.000000008"} 0
mem_estimator_runtime_s_bucket{le="0.000000016"} 0
mem_estimator_runtime_s_bucket{le="0.000000032"} 0
mem_estimator_runtime_s_bucket{le="0.000000064"} 0
mem_estimator_runtime_s_bucket{le="0.000000128"} 0
mem_estimator_runtime_s_bucket{le="0.000000256"} 0
mem_estimator_runtime_s_bucket{le="0.000000512"} 0
mem_estimator_runtime_s_bucket{le="0.000001024"} 0
mem_estimator_runtime_s_bucket{le="0.000002048"} 0
mem_estimator_runtime_s_bucket{le="0.000004096"} 0
mem_estimator_runtime_s_bucket{le="0.000008192"} 0
mem_estimator_runtime_s_bucket{le="0.000016384"} 0
mem_estimator_runtime_s_bucket{le="0.000032768"} 0
mem_estimator_runtime_s_bucket{le="0.000065536"} 0
mem_estimator_runtime_s_bucket{le="0.000131072"} 0
mem_estimator_runtime_s_bucket{le="0.000262144"} 0
mem_estimator_runtime_s_bucket{le="0.000524288"} 0
mem_estimator_runtime_s_bucket{le="0.001048576"} 0
mem_estimator_runtime_s_bucket{le="0.002097152"} 3145
mem_estimator_runtime_s_bucket{le="0.004194304"} 122148
mem_estimator_runtime_s_bucket{le="0.008388608"} 122240
mem_estimator_runtime_s_bucket{le="0.016777216"} 122242
mem_estimator_runtime_s_bucket{le="0.033554432"} 122242
mem_estimator_runtime_s_bucket{le="0.067108864"} 122242
mem_estimator_runtime_s_bucket{le="0.134217728"} 122242
mem_estimator_runtime_s_bucket{le="0.268435456"} 122242
mem_estimator_runtime_s_bucket{le="0.536870912"} 122242
mem_estimator_runtime_s_bucket{le="1.073741824"} 122242
mem_estimator_runtime_s_bucket{le="2.147483648"} 122242
mem_estimator_runtime_s_bucket{le="4.294967296"} 122242
mem_estimator_runtime_s_bucket{le="8.589934592"} 122242
To:
# HELP mem_estimator_runtime_s time in seconds to estimate memory usage
# TYPE mem_estimator_runtime_s histogram
mem_estimator_runtime_s_bucket{le="0.0004"} 0
mem_estimator_runtime_s_bucket{le="0.0008"} 0
mem_estimator_runtime_s_bucket{le="0.0016"} 0
mem_estimator_runtime_s_bucket{le="0.0032"} 0
mem_estimator_runtime_s_bucket{le="0.0064"} 0
mem_estimator_runtime_s_bucket{le="0.0128"} 0
mem_estimator_runtime_s_bucket{le="0.0256"} 0
mem_estimator_runtime_s_bucket{le="0.0512"} 0
mem_estimator_runtime_s_bucket{le="0.1024"} 0
mem_estimator_runtime_s_bucket{le="0.2048"} 0
mem_estimator_runtime_s_bucket{le="0.4096"} 0
mem_estimator_runtime_s_bucket{le="0.8192"} 0
mem_estimator_runtime_s_bucket{le="1.6384"} 0
Do we programmatically calculate histogram? This seems like it could be much simpler:
# HELP historical_block_sync_duration_seconds duration (in sec) to synchronize a historical block
# TYPE historical_block_sync_duration_seconds histogram
historical_block_sync_duration_seconds_bucket{le="0.05"} 0
historical_block_sync_duration_seconds_bucket{le="0.08750000000000001"} 0
historical_block_sync_duration_seconds_bucket{le="0.153125"} 0
historical_block_sync_duration_seconds_bucket{le="0.26796875000000003"} 0
historical_block_sync_duration_seconds_bucket{le="0.46894531250000004"} 0
historical_block_sync_duration_seconds_bucket{le="0.8206542968750001"} 0
historical_block_sync_duration_seconds_bucket{le="1.4361450195312502"} 28
historical_block_sync_duration_seconds_bucket{le="2.513253784179688"} 28
historical_block_sync_duration_seconds_bucket{le="4.398194122314454"} 28
historical_block_sync_duration_seconds_bucket{le="7.696839714050294"} 28
historical_block_sync_duration_seconds_bucket{le="+Inf"} 28
historical_block_sync_duration_seconds_sum 29.006609401
historical_block_sync_duration_seconds_count 28
Change to:
# HELP historical_block_sync_duration_seconds duration (in sec) to synchronize a historical block
# TYPE historical_block_sync_duration_seconds histogram
historical_block_sync_duration_seconds_bucket{le="0.2"} 0
historical_block_sync_duration_seconds_bucket{le="0.4"} 0
historical_block_sync_duration_seconds_bucket{le="0.8"} 0
historical_block_sync_duration_seconds_bucket{le="1.2"} 0
historical_block_sync_duration_seconds_bucket{le="1.6"} 0
historical_block_sync_duration_seconds_bucket{le="2"} 0
historical_block_sync_duration_seconds_bucket{le="4"} 0
historical_block_sync_duration_seconds_bucket{le="8"} 0
historical_block_sync_duration_seconds_bucket{le="+Inf"} 28
historical_block_sync_duration_seconds_sum 29.006609401
historical_block_sync_duration_seconds_count 28
Another to simplify, but values are not bad.
# HELP forward_block_sync_duration_seconds duration (in sec) to synchronize a forward block
# TYPE forward_block_sync_duration_seconds histogram
forward_block_sync_duration_seconds_bucket{le="0.05"} 0
forward_block_sync_duration_seconds_bucket{le="0.08750000000000001"} 3
forward_block_sync_duration_seconds_bucket{le="0.153125"} 1621
forward_block_sync_duration_seconds_bucket{le="0.26796875000000003"} 87641
forward_block_sync_duration_seconds_bucket{le="0.46894531250000004"} 297390
forward_block_sync_duration_seconds_bucket{le="0.8206542968750001"} 399370
forward_block_sync_duration_seconds_bucket{le="1.4361450195312502"} 558702
forward_block_sync_duration_seconds_bucket{le="2.513253784179688"} 619611
forward_block_sync_duration_seconds_bucket{le="4.398194122314454"} 625065
forward_block_sync_duration_seconds_bucket{le="7.696839714050294"} 625210
forward_block_sync_duration_seconds_bucket{le="+Inf"} 625762
forward_block_sync_duration_seconds_sum 465421.15746638214
forward_block_sync_duration_seconds_count 625762
# HELP forward_block_sync_duration_seconds duration (in sec) to synchronize a forward block
# TYPE forward_block_sync_duration_seconds histogram
forward_block_sync_duration_seconds_bucket{le="0.05"} 0
forward_block_sync_duration_seconds_bucket{le="0.1"} 3
forward_block_sync_duration_seconds_bucket{le="0.2"} 1621
forward_block_sync_duration_seconds_bucket{le="0.4"} 87641
forward_block_sync_duration_seconds_bucket{le="0.8"} 297390
forward_block_sync_duration_seconds_bucket{le="1.6"} 558702
forward_block_sync_duration_seconds_bucket{le="3.2"} 619611
forward_block_sync_duration_seconds_bucket{le="6.4"} 625065
forward_block_sync_duration_seconds_bucket{le="12.8"} 625210
forward_block_sync_duration_seconds_bucket{le="+Inf"} 625762
forward_block_sync_duration_seconds_sum 465421.15746638214
forward_block_sync_duration_seconds_count 625762
Thank you @sacherjj for adding your analysis findings. Reassigning the ticket to @darthsiroftardis as discussed in the sprint entry.