harvest icon indicating copy to clipboard operation
harvest copied to clipboard

Harvest should improve CI to detect missing metrics

Open cgrinds opened this issue 1 year ago • 2 comments

Recent example #2819

  • [ ] Run the harvest REST/ZAPI diff CLI and analyze result

cgrinds avatar Apr 09 '24 15:04 cgrinds

Consider catching examples like #2816

rahulguptajss avatar Apr 10 '24 08:04 rahulguptajss

Below is the summary for Rest/Zapi gaps

Most of these are known.

Missing Metrics by Object in Prometheus

  • flexcache: [flexcache_reconciled_data_entries, flexcache_evict_skipped_reason_offline, flexcache_blocks_retrieved_from_origin, flexcache_reconciled_lock_entries, flexcache_nix_skipped_reason_in_progress, flexcache_nix_skipped_reason_disconnected, flexcache_nix_retry_skipped_reason_initiator_retrieve, flexcache_evict_skipped_reason_disconnected, flexcache_nix_skipped_reason_offline, flexcache_nix_skipped_reason_config_noent, flexcache_evict_rw_cache_skipped_reason_disconnected, flexcache_invalidate_skipped_reason_config_noent, flexcache_invalidate_skipped_reason_disconnected, flexcache_evict_skipped_reason_config_noent, flexcache_invalidate_skipped_reason_offline, flexcache_blocks_requested_from_client, flexcache_miss_percent]

Reason: Missing flexcache template in Rest

  • iscsi: [iscsi_lif_cmd_transfered]

Reason: Rest publishes with correct spelling iscsi_lif_cmd_transferred

  • rw: [rw_ctx_nfs_rewinds, rw_ctx_qos_flowcontrol, rw_ctx_nfs_giveups, rw_ctx_qos_rewinds, rw_ctx_cifs_giveups, rw_ctx_cifs_rewinds]

Reason: Missing template in Rest

  • qtree: [qtree_threshold]

Reason: Not available in Rest. The REST API provides space.soft_limit which has similar functionality.

  • wafl: [wafl_reads_from_pmem]

Reason: Not available in Rest

  • security: [security_audit_destination_port]

Reason: Not available in Rest

  • aggr: [aggr_hybrid_cache_size_total, aggr_snapshot_inode_used_percent, aggr_space_reserved]

Reason: Not available in Rest

  • node: [node_nvme_fc_data_sent, node_nvme_fc_ops, node_nvme_fc_data_recv]

Reason We are publishing these in Zapi and Rest as node_nvmf_data_sent,node_nvmf_ops and node_nvmf_data_recv. But these are old names. Should we use new nvme names?

  • quota: [quota_threshold]

Reason: Not available in Rest. The REST API provides space.soft_limit which has similar functionality.

Missing Metrics from Dashboard

  • aggr: [aggr_snapshot_inode_used_percent]
  • flexcache: [flexcache_reconciled_data_entries, flexcache_evict_skipped_reason_offline, flexcache_blocks_retrieved_from_origin, flexcache_reconciled_lock_entries, flexcache_nix_skipped_reason_in_progress, flexcache_nix_skipped_reason_disconnected, flexcache_nix_retry_skipped_reason_initiator_retrieve, flexcache_evict_skipped_reason_disconnected, flexcache_nix_skipped_reason_offline, flexcache_nix_skipped_reason_config_noent, flexcache_evict_rw_cache_skipped_reason_disconnected, flexcache_invalidate_skipped_reason_config_noent, flexcache_invalidate_skipped_reason_disconnected, flexcache_evict_skipped_reason_config_noent, flexcache_blocks_requested_from_client, flexcache_miss_percent]
  • rw: [rw_ctx_nfs_rewinds, rw_ctx_qos_flowcontrol, rw_ctx_nfs_giveups, rw_ctx_qos_rewinds, rw_ctx_cifs_giveups, rw_ctx_cifs_rewinds]

rahulguptajss avatar May 09 '24 07:05 rahulguptajss

When ZAPI and REST have different spellings or metric names. Pick one:

  • change REST templates to match ZAPI

  • change ZAPI to match REST, after consulting with community

  • Nvme name changes happen because of counter deprecation. For REST, choose the new names

  • We should add space.soft_limit to REST. Change plugin to publish same as ZAPI

  • Consider consolidating the two tools: bin/harvest generate metrics, bin/harvest doctor diffzapirest and address the issue identify in this issue that neither of the tools handle well today. If needed, encode the known issues into the tool(s) so we track these and don't have to revisit before release

cgrinds avatar May 23 '24 12:05 cgrinds

Here are some additional improvements for CI that we discussed:

Currently, we run all smoke tests for RPM, native, and containers. We could limit the RPM to only the upgrade test, and restrict containers to test Harvest start and stop commands only. We could run the full smoke suite for only native.

rahulguptajss avatar May 30 '24 10:05 rahulguptajss