Harvest should improve CI to detect missing metrics
Recent example #2819
- [ ] Run the harvest REST/ZAPI diff CLI and analyze result
Consider catching examples like #2816
Below is the summary for Rest/Zapi gaps
Most of these are known.
Missing Metrics by Object in Prometheus
- flexcache: [flexcache_reconciled_data_entries, flexcache_evict_skipped_reason_offline, flexcache_blocks_retrieved_from_origin, flexcache_reconciled_lock_entries, flexcache_nix_skipped_reason_in_progress, flexcache_nix_skipped_reason_disconnected, flexcache_nix_retry_skipped_reason_initiator_retrieve, flexcache_evict_skipped_reason_disconnected, flexcache_nix_skipped_reason_offline, flexcache_nix_skipped_reason_config_noent, flexcache_evict_rw_cache_skipped_reason_disconnected, flexcache_invalidate_skipped_reason_config_noent, flexcache_invalidate_skipped_reason_disconnected, flexcache_evict_skipped_reason_config_noent, flexcache_invalidate_skipped_reason_offline, flexcache_blocks_requested_from_client, flexcache_miss_percent]
Reason: Missing flexcache template in Rest
- iscsi: [iscsi_lif_cmd_transfered]
Reason: Rest publishes with correct spelling iscsi_lif_cmd_transferred
- rw: [rw_ctx_nfs_rewinds, rw_ctx_qos_flowcontrol, rw_ctx_nfs_giveups, rw_ctx_qos_rewinds, rw_ctx_cifs_giveups, rw_ctx_cifs_rewinds]
Reason: Missing template in Rest
- qtree: [qtree_threshold]
Reason: Not available in Rest. The REST API provides space.soft_limit which has similar functionality.
- wafl: [wafl_reads_from_pmem]
Reason: Not available in Rest
- security: [security_audit_destination_port]
Reason: Not available in Rest
- aggr: [aggr_hybrid_cache_size_total, aggr_snapshot_inode_used_percent, aggr_space_reserved]
Reason: Not available in Rest
- node: [node_nvme_fc_data_sent, node_nvme_fc_ops, node_nvme_fc_data_recv]
Reason We are publishing these in Zapi and Rest as node_nvmf_data_sent,node_nvmf_ops and node_nvmf_data_recv. But these are old names. Should we use new nvme names?
- quota: [quota_threshold]
Reason: Not available in Rest. The REST API provides space.soft_limit which has similar functionality.
Missing Metrics from Dashboard
- aggr: [aggr_snapshot_inode_used_percent]
- flexcache: [flexcache_reconciled_data_entries, flexcache_evict_skipped_reason_offline, flexcache_blocks_retrieved_from_origin, flexcache_reconciled_lock_entries, flexcache_nix_skipped_reason_in_progress, flexcache_nix_skipped_reason_disconnected, flexcache_nix_retry_skipped_reason_initiator_retrieve, flexcache_evict_skipped_reason_disconnected, flexcache_nix_skipped_reason_offline, flexcache_nix_skipped_reason_config_noent, flexcache_evict_rw_cache_skipped_reason_disconnected, flexcache_invalidate_skipped_reason_config_noent, flexcache_invalidate_skipped_reason_disconnected, flexcache_evict_skipped_reason_config_noent, flexcache_blocks_requested_from_client, flexcache_miss_percent]
- rw: [rw_ctx_nfs_rewinds, rw_ctx_qos_flowcontrol, rw_ctx_nfs_giveups, rw_ctx_qos_rewinds, rw_ctx_cifs_giveups, rw_ctx_cifs_rewinds]
When ZAPI and REST have different spellings or metric names. Pick one:
-
change REST templates to match ZAPI
-
change ZAPI to match REST, after consulting with community
-
Nvme name changes happen because of counter deprecation. For REST, choose the new names
-
We should add space.soft_limit to REST. Change plugin to publish same as ZAPI
-
Consider consolidating the two tools:
bin/harvest generate metrics,bin/harvest doctor diffzapirestand address the issue identify in this issue that neither of the tools handle well today. If needed, encode the known issues into the tool(s) so we track these and don't have to revisit before release
Here are some additional improvements for CI that we discussed:
Currently, we run all smoke tests for RPM, native, and containers. We could limit the RPM to only the upgrade test, and restrict containers to test Harvest start and stop commands only. We could run the full smoke suite for only native.