casper-node icon indicating copy to clipboard operation
casper-node copied to clipboard

Spike: Review non-histogram metrics to deteremine if they should be histograms

Open RitaMAllenCA opened this issue 2 years ago • 7 comments
trafficstars

Correlating different time-related metrics can provide insights into how to tune certain parameters or indicate issues. We need to have a review off all time-related metrics and make sure we use histograms for them. Have a pass through all these time related metrics that are related and decide if we should normalize their parameters (number of buckets, bucket witdths). Look for correlation between this metrics and add some determinations or other compound metrics that would show anomalies given system invariants. For example block time should be constant (or at least within a certain range of lanes) so correlating time based metrics from the components involved in the production of blocks (block execution, finalized block creation, event queue latency for the related messages etc) would provide insights of if and how we reach the required deadline.

Deliverable is finding which metrics are currently useful to engineers and which are not, and if any are missing.

RitaMAllenCA avatar Mar 17 '23 14:03 RitaMAllenCA

@alsrdn needs an estimate

RitaMAllenCA avatar Mar 17 '23 14:03 RitaMAllenCA

where are we NOT using histograms but should be? What other kind of trend analysis should we do? ' The goal is to have a minimum number of metrics that are the most impactful for drawing conclusions.

RitaMAllenCA avatar Mar 24 '23 16:03 RitaMAllenCA

Possible Teonite ticket? @devendran-m

devendran-m avatar Sep 13 '24 17:09 devendran-m

List of all metrics + type:

Click to expand
No. Metric Name Type
1 accumulated_incoming_limiter_delay counter
2 net_in_bytes_protocol counter
3 net_in_bytes_trie_transfer counter
4 net_in_count_address_gossip counter
5 net_in_count_block_gossip counter
6 net_in_count_block_transfer counter
7 net_in_count_consensus counter
8 net_in_bytes_other counter
9 net_in_count_deploy_gossip counter
10 net_in_count_finality_signature_gossip counter
11 net_in_count_other counter
12 net_in_count_protocol counter
13 net_in_count_trie_transfer counter
14 net_out_bytes_address_gossip counter
15 net_out_bytes_block_gossip counter
16 net_in_count_deploy_transfer counter
17 net_out_bytes_block_transfer counter
18 net_in_bytes_finality_signature_gossip counter
19 net_in_bytes_deploy_gossip counter
20 finality_signature_gossiper_items_received counter
21 finality_signature_gossiper_times_gossiped counter
22 finality_signature_gossiper_times_ran_out_of_peers counter
23 legacy_deploy_fetch_total counter
24 legacy_deploy_found_in_storage counter
25 legacy_deploy_found_on_peer counter
26 net_in_bytes_deploy_transfer counter
27 legacy_deploy_timeouts counter
28 net_broadcast_requests counter
29 net_direct_message_requests counter
30 net_in_bytes_address_gossip counter
31 net_in_bytes_block_gossip counter
32 net_in_bytes_block_transfer counter
33 net_in_bytes_consensus counter
34 trie_or_chunk_found_on_peer counter
35 net_out_bytes_consensus counter
36 net_out_bytes_deploy_gossip counter
37 net_out_bytes_deploy_transfer counter
38 sync_leap_fetcher_fetch_total counter
39 sync_leap_fetcher_found_in_storage counter
40 sync_leap_fetcher_found_on_peer counter
41 sync_leap_fetcher_timeouts counter
42 sync_leap_rejected_by_peer_total counter
43 transaction_fetch_total counter
44 sync_leap_fetched_from_peer_total counter
45 transaction_found_in_storage counter
46 transaction_gossiper_items_received counter
47 transaction_gossiper_times_gossiped counter
48 transaction_gossiper_times_ran_out_of_peers counter
49 transaction_timeouts counter
50 trie_or_chunk_fetch_total counter
51 trie_or_chunk_found_in_storage counter
52 transaction_found_on_peer counter
53 sync_leap_cant_fetch_total counter
54 runner_events counter
55 requests_for_trie_finished counter
56 net_out_bytes_finality_signature_gossip counter
57 net_out_bytes_other counter
58 net_out_bytes_protocol counter
59 net_out_bytes_trie_transfer counter
60 net_out_count_address_gossip counter
61 net_out_count_block_gossip counter
62 net_out_count_block_transfer counter
63 net_out_count_consensus counter
64 net_out_count_deploy_gossip counter
65 net_out_count_deploy_transfer counter
66 net_out_count_finality_signature_gossip counter
67 net_out_count_other counter
68 net_out_count_protocol counter
69 net_out_count_trie_transfer counter
70 requests_for_trie_accepted counter
71 finality_signature_fetcher_timeouts counter
72 finality_signature_fetcher_found_on_peer counter
73 trie_or_chunk_timeouts counter
74 finality_signature_fetcher_fetch_total counter
75 block_header_found_on_peer counter
76 block_header_found_in_storage counter
77 block_header_fetch_total counter
78 finality_signature_fetcher_found_in_storage counter
79 block_gossiper_times_gossiped counter
80 block_gossiper_items_received counter
81 block_found_on_peer counter
82 block_found_in_storage counter
83 block_fetch_total counter
84 block_execution_results_or_chunk_fetcher_timeouts counter
85 block_execution_results_or_chunk_fetcher_found_on_peer counter
86 block_execution_results_or_chunk_fetcher_found_in_storage counter
87 block_execution_results_or_chunk_fetcher_fetch_total counter
88 binary_port_try_speculative_exec_count counter
89 binary_port_try_accept_transaction_count counter
90 binary_port_get_state_count counter
91 binary_port_get_record_count counter
92 binary_port_get_info_count counter
93 binary_port_connections_count counter
94 approvals_hashes_timeouts counter
95 approvals_hashes_found_on_peer counter
96 approvals_hashes_found_in_storage counter
97 approvals_hashes_fetch_total counter
98 address_gossiper_times_ran_out_of_peers counter
99 address_gossiper_times_gossiped counter
100 address_gossiper_items_received counter
101 accumulated_outgoing_limiter_delay counter
102 block_header_timeouts counter
103 block_timeouts counter
104 block_gossiper_times_ran_out_of_peers counter
105 mem_rest_server gauge
106 scheduler_queue_network_demands_count gauge
107 scheduler_queue_network_count gauge
108 scheduler_queue_gossip_count gauge
109 scheduler_queue_from_storage_count gauge
110 scheduler_queue_finality_signature_count gauge
111 scheduler_queue_fetch_count gauge
112 scheduler_queue_control_count gauge
113 scheduler_queue_contract_runtime_count gauge
114 scheduler_queue_consensus_count gauge
115 scheduler_queue_network_incoming_count gauge
116 scheduler_queue_api_count gauge
117 block_accumulator_block_acceptors gauge
118 peers gauge
119 out_state_waiting gauge
120 out_state_loopback gauge
121 out_state_connecting gauge
122 mem_net gauge
123 out_state_blocked gauge
124 net_queued_direct_messages gauge
125 block_accumulator_known_child_blocks gauge
126 contract_runtime_latest_commit_step gauge
127 scheduler_queue_network_info_count gauge
128 scheduler_queue_network_low_priority_count gauge
129 scheduler_queue_regular_count gauge
130 execution_queue_size gauge
131 address_gossiper_table_items_current gauge
132 address_gossiper_table_items_finished gauge
133 finality_signature_gossiper_table_items_current gauge
134 transaction_gossiper_table_items_finished gauge
135 transaction_gossiper_table_items_current gauge
136 finality_signature_gossiper_table_items_finished gauge
137 allocated_ram_bytes gauge
138 amount_of_blocks gauge
139 transaction_buffer_total_transactions gauge
140 transaction_buffer_held_transactions gauge
141 transaction_buffer_dead_transactions gauge
142 total_ram_bytes gauge
143 time_of_last_finalized_block gauge
144 time_of_last_block_payload gauge
145 finalization_time gauge
146 highest_available_block_height gauge
147 scheduler_queue_validation_count gauge
148 scheduler_queue_total_count gauge
149 scheduler_queue_to_storage_count gauge
150 scheduler_queue_sync_global_state_count gauge
151 lowest_available_block_height gauge
152 mem_address_gossiper gauge
153 out_state_connected gauge
154 mem_block_accumulator gauge
155 mem_storage gauge
156 mem_sync_leaper gauge
157 mem_total gauge
158 mem_transaction_acceptor gauge
159 mem_transaction_buffer gauge
160 mem_transaction_gossiper gauge
161 mem_upgrade_watcher gauge
162 mem_binary_port gauge
163 consumed_ram_bytes gauge
164 consensus_current_era gauge
165 chain_height gauge
166 mem_metrics gauge
167 mem_rpc_server gauge
168 mem_fetchers gauge
169 mem_diagnostics_port gauge
170 mem_contract_runtime gauge
171 block_gossiper_table_items_finished gauge
172 block_gossiper_table_items_current gauge
173 mem_consensus gauge
174 mem_finality_signature_gossiper gauge
175 mem_block_synchronizer gauge
176 mem_block_gossiper gauge
177 mem_event_stream_server gauge
178 mem_block_validator gauge
179 event_dispatch_duration histogram
180 contract_runtime_execution_results_checksum histogram
181 contract_runtime_txn_approvals_hashes_calculation histogram
182 contract_runtime_get_seigniorage_recipients histogram
183 contract_runtime_get_round_seigniorage_rate histogram
184 contract_runtime_get_trie histogram
185 contract_runtime_get_era_validators histogram
186 transaction_acceptor_rejected_transaction histogram
187 transaction_acceptor_accepted_transaction histogram
188 contract_runtime_get_balance histogram
189 contract_runtime_get_all_values histogram
190 contract_runtime_get_total_supply histogram
191 contract_runtime_execute_block histogram
192 forward_block_sync_duration_seconds histogram
193 contract_runtime_put_trie histogram
194 mem_estimator_runtime_s histogram
195 contract_runtime_addressable_entity histogram
196 contract_runtime_batch_pruning_time histogram
197 contract_runtime_block_rewards_payout histogram
198 contract_runtime_commit_genesis histogram
199 contract_runtime_commit_step histogram
200 contract_runtime_commit_upgrade histogram
201 contract_runtime_scratch_lmdb_write_time histogram
202 contract_runtime_db_flush_time histogram
203 contract_runtime_exec_block_post_proc histogram
204 contract_runtime_exec_block_pre_proc histogram
205 contract_runtime_exec_block_step_proc histogram
206 contract_runtime_run_query histogram
207 sync_leap_duration_seconds histogram
208 contract_runtime_run_query_by_prefix histogram
209 historical_block_sync_duration_seconds histogram
210 contract_runtime_exec_wasm_v1 histogram
211 contract_runtime_entry_point histogram
212 contract_runtime_exec_block_total_proc histogram

przemyslaw avatar Sep 17 '24 14:09 przemyslaw

Ed gave the following direction on the L1 Daily on 17th Sep;

  • Determine if any of the counters and gauges should not be turned into histograms, and then turn all the remaining ones into histograms.
  • Validate and adjust bucket sizing for metrics.

devendran-m avatar Sep 18 '24 12:09 devendran-m

Could be changed to histograms:

  1. accumulated_incoming_limiter_delay (counter) Track the distribution of delay times experienced in the incoming traffic to the node [0.1, 0.5, 1, 2, 5, 10] seconds.

  2. accumulated_outgoing_limiter_delay (counter) Distribution of delay times in the outgoing traffic. [0.1, 0.5, 1, 2, 5, 10] seconds.

przemyslaw avatar Sep 25 '24 11:09 przemyslaw

Hi @sacherjj ,

Ed gave the following direction on the L1 Daily on 17th Sep. Could you help look into this, please?

  • Determine if any of the counters and gauges should not be turned into histograms, and then turn all the remaining ones into histograms.
  • Validate and adjust bucket sizing for metrics.

devendran-m avatar Sep 25 '24 14:09 devendran-m

To low of buckets (takes MUCH longer on network with tons of delegations):

# HELP contract_runtime_commit_step time in seconds to commit the step at era end
# TYPE contract_runtime_commit_step histogram
contract_runtime_commit_step_bucket{le="0.01"} 0
contract_runtime_commit_step_bucket{le="0.02"} 0
contract_runtime_commit_step_bucket{le="0.04"} 0
contract_runtime_commit_step_bucket{le="0.08"} 0
contract_runtime_commit_step_bucket{le="0.16"} 0
contract_runtime_commit_step_bucket{le="0.32"} 383
contract_runtime_commit_step_bucket{le="0.64"} 383
contract_runtime_commit_step_bucket{le="1.28"} 383
contract_runtime_commit_step_bucket{le="2.56"} 383
contract_runtime_commit_step_bucket{le="5.12"} 383
contract_runtime_commit_step_bucket{le="+Inf"} 383
contract_runtime_commit_step_sum 73.34488397599996
contract_runtime_commit_step_count 383

Convert to:

# HELP contract_runtime_commit_step time in seconds to commit the step at era end
# TYPE contract_runtime_commit_step histogram
contract_runtime_commit_step_bucket{le="0.2"} 0
contract_runtime_commit_step_bucket{le="0.4"} 0
contract_runtime_commit_step_bucket{le="0.8"} 0
contract_runtime_commit_step_bucket{le="1.6"} 0
contract_runtime_commit_step_bucket{le="3.2"} 383
contract_runtime_commit_step_bucket{le="6.4"} 383
contract_runtime_commit_step_bucket{le="12.8"} 383
contract_runtime_commit_step_bucket{le="25.6"} 383
contract_runtime_commit_step_bucket{le="51.2"} 383
contract_runtime_commit_step_bucket{le="+Inf"} 383
contract_runtime_commit_step_sum 73.34488397599996
contract_runtime_commit_step_count 383

sacherjj avatar Oct 21 '24 15:10 sacherjj

Not sure what this metric is, but if in seconds do we expect to have a transaction that takes an hour and a half to get accepted? Seems like we need smaller numbers, but I don't understand the metric fully.

# HELP transaction_acceptor_accepted_transaction time in seconds to accept a transaction in the transaction acceptor
# TYPE transaction_acceptor_accepted_transaction histogram
transaction_acceptor_accepted_transaction_bucket{le="10"} 562
transaction_acceptor_accepted_transaction_bucket{le="20"} 575
transaction_acceptor_accepted_transaction_bucket{le="40"} 575
transaction_acceptor_accepted_transaction_bucket{le="80"} 578
transaction_acceptor_accepted_transaction_bucket{le="160"} 579
transaction_acceptor_accepted_transaction_bucket{le="320"} 579
transaction_acceptor_accepted_transaction_bucket{le="640"} 579
transaction_acceptor_accepted_transaction_bucket{le="1280"} 579
transaction_acceptor_accepted_transaction_bucket{le="2560"} 579
transaction_acceptor_accepted_transaction_bucket{le="5120"} 579
transaction_acceptor_accepted_transaction_bucket{le="+Inf"} 579
transaction_acceptor_accepted_transaction_sum 3076
transaction_acceptor_accepted_transaction_count 579

sacherjj avatar Oct 21 '24 15:10 sacherjj

Reduce:

# HELP mem_estimator_runtime_s time in seconds to estimate memory usage
# TYPE mem_estimator_runtime_s histogram
mem_estimator_runtime_s_bucket{le="0.000000004"} 0
mem_estimator_runtime_s_bucket{le="0.000000008"} 0
mem_estimator_runtime_s_bucket{le="0.000000016"} 0
mem_estimator_runtime_s_bucket{le="0.000000032"} 0
mem_estimator_runtime_s_bucket{le="0.000000064"} 0
mem_estimator_runtime_s_bucket{le="0.000000128"} 0
mem_estimator_runtime_s_bucket{le="0.000000256"} 0
mem_estimator_runtime_s_bucket{le="0.000000512"} 0
mem_estimator_runtime_s_bucket{le="0.000001024"} 0
mem_estimator_runtime_s_bucket{le="0.000002048"} 0
mem_estimator_runtime_s_bucket{le="0.000004096"} 0
mem_estimator_runtime_s_bucket{le="0.000008192"} 0
mem_estimator_runtime_s_bucket{le="0.000016384"} 0
mem_estimator_runtime_s_bucket{le="0.000032768"} 0
mem_estimator_runtime_s_bucket{le="0.000065536"} 0
mem_estimator_runtime_s_bucket{le="0.000131072"} 0
mem_estimator_runtime_s_bucket{le="0.000262144"} 0
mem_estimator_runtime_s_bucket{le="0.000524288"} 0
mem_estimator_runtime_s_bucket{le="0.001048576"} 0
mem_estimator_runtime_s_bucket{le="0.002097152"} 3145
mem_estimator_runtime_s_bucket{le="0.004194304"} 122148
mem_estimator_runtime_s_bucket{le="0.008388608"} 122240
mem_estimator_runtime_s_bucket{le="0.016777216"} 122242
mem_estimator_runtime_s_bucket{le="0.033554432"} 122242
mem_estimator_runtime_s_bucket{le="0.067108864"} 122242
mem_estimator_runtime_s_bucket{le="0.134217728"} 122242
mem_estimator_runtime_s_bucket{le="0.268435456"} 122242
mem_estimator_runtime_s_bucket{le="0.536870912"} 122242
mem_estimator_runtime_s_bucket{le="1.073741824"} 122242
mem_estimator_runtime_s_bucket{le="2.147483648"} 122242
mem_estimator_runtime_s_bucket{le="4.294967296"} 122242
mem_estimator_runtime_s_bucket{le="8.589934592"} 122242

To:

# HELP mem_estimator_runtime_s time in seconds to estimate memory usage
# TYPE mem_estimator_runtime_s histogram
mem_estimator_runtime_s_bucket{le="0.0004"} 0
mem_estimator_runtime_s_bucket{le="0.0008"} 0
mem_estimator_runtime_s_bucket{le="0.0016"} 0
mem_estimator_runtime_s_bucket{le="0.0032"} 0
mem_estimator_runtime_s_bucket{le="0.0064"} 0
mem_estimator_runtime_s_bucket{le="0.0128"} 0
mem_estimator_runtime_s_bucket{le="0.0256"} 0
mem_estimator_runtime_s_bucket{le="0.0512"} 0
mem_estimator_runtime_s_bucket{le="0.1024"} 0
mem_estimator_runtime_s_bucket{le="0.2048"} 0
mem_estimator_runtime_s_bucket{le="0.4096"} 0
mem_estimator_runtime_s_bucket{le="0.8192"} 0
mem_estimator_runtime_s_bucket{le="1.6384"} 0

sacherjj avatar Oct 21 '24 15:10 sacherjj

Do we programmatically calculate histogram? This seems like it could be much simpler:

# HELP historical_block_sync_duration_seconds duration (in sec) to synchronize a historical block
# TYPE historical_block_sync_duration_seconds histogram
historical_block_sync_duration_seconds_bucket{le="0.05"} 0
historical_block_sync_duration_seconds_bucket{le="0.08750000000000001"} 0
historical_block_sync_duration_seconds_bucket{le="0.153125"} 0
historical_block_sync_duration_seconds_bucket{le="0.26796875000000003"} 0
historical_block_sync_duration_seconds_bucket{le="0.46894531250000004"} 0
historical_block_sync_duration_seconds_bucket{le="0.8206542968750001"} 0
historical_block_sync_duration_seconds_bucket{le="1.4361450195312502"} 28
historical_block_sync_duration_seconds_bucket{le="2.513253784179688"} 28
historical_block_sync_duration_seconds_bucket{le="4.398194122314454"} 28
historical_block_sync_duration_seconds_bucket{le="7.696839714050294"} 28
historical_block_sync_duration_seconds_bucket{le="+Inf"} 28
historical_block_sync_duration_seconds_sum 29.006609401
historical_block_sync_duration_seconds_count 28

Change to:

# HELP historical_block_sync_duration_seconds duration (in sec) to synchronize a historical block
# TYPE historical_block_sync_duration_seconds histogram
historical_block_sync_duration_seconds_bucket{le="0.2"} 0
historical_block_sync_duration_seconds_bucket{le="0.4"} 0
historical_block_sync_duration_seconds_bucket{le="0.8"} 0
historical_block_sync_duration_seconds_bucket{le="1.2"} 0
historical_block_sync_duration_seconds_bucket{le="1.6"} 0
historical_block_sync_duration_seconds_bucket{le="2"} 0
historical_block_sync_duration_seconds_bucket{le="4"} 0
historical_block_sync_duration_seconds_bucket{le="8"} 0
historical_block_sync_duration_seconds_bucket{le="+Inf"} 28
historical_block_sync_duration_seconds_sum 29.006609401
historical_block_sync_duration_seconds_count 28

sacherjj avatar Oct 21 '24 15:10 sacherjj

Another to simplify, but values are not bad.

# HELP forward_block_sync_duration_seconds duration (in sec) to synchronize a forward block
# TYPE forward_block_sync_duration_seconds histogram
forward_block_sync_duration_seconds_bucket{le="0.05"} 0
forward_block_sync_duration_seconds_bucket{le="0.08750000000000001"} 3
forward_block_sync_duration_seconds_bucket{le="0.153125"} 1621
forward_block_sync_duration_seconds_bucket{le="0.26796875000000003"} 87641
forward_block_sync_duration_seconds_bucket{le="0.46894531250000004"} 297390
forward_block_sync_duration_seconds_bucket{le="0.8206542968750001"} 399370
forward_block_sync_duration_seconds_bucket{le="1.4361450195312502"} 558702
forward_block_sync_duration_seconds_bucket{le="2.513253784179688"} 619611
forward_block_sync_duration_seconds_bucket{le="4.398194122314454"} 625065
forward_block_sync_duration_seconds_bucket{le="7.696839714050294"} 625210
forward_block_sync_duration_seconds_bucket{le="+Inf"} 625762
forward_block_sync_duration_seconds_sum 465421.15746638214
forward_block_sync_duration_seconds_count 625762
# HELP forward_block_sync_duration_seconds duration (in sec) to synchronize a forward block
# TYPE forward_block_sync_duration_seconds histogram
forward_block_sync_duration_seconds_bucket{le="0.05"} 0
forward_block_sync_duration_seconds_bucket{le="0.1"} 3
forward_block_sync_duration_seconds_bucket{le="0.2"} 1621
forward_block_sync_duration_seconds_bucket{le="0.4"} 87641
forward_block_sync_duration_seconds_bucket{le="0.8"} 297390
forward_block_sync_duration_seconds_bucket{le="1.6"} 558702
forward_block_sync_duration_seconds_bucket{le="3.2"} 619611
forward_block_sync_duration_seconds_bucket{le="6.4"} 625065
forward_block_sync_duration_seconds_bucket{le="12.8"} 625210
forward_block_sync_duration_seconds_bucket{le="+Inf"} 625762
forward_block_sync_duration_seconds_sum 465421.15746638214
forward_block_sync_duration_seconds_count 625762

sacherjj avatar Oct 21 '24 15:10 sacherjj

Thank you @sacherjj for adding your analysis findings. Reassigning the ticket to @darthsiroftardis as discussed in the sprint entry.

devendran-m avatar Oct 22 '24 10:10 devendran-m