metacatui icon indicating copy to clipboard operation
metacatui copied to clipboard

Member Node Status Dashboard

Open vchendrix opened this issue 6 years ago • 3 comments

This is a low priority and needs discussion.

Develop audit dashboard (Low priority) Design and develop a dashboard UI that displays the audit status as of last run, ideally with some extra detail for objects in invalid state.

This could also include statuses of object synchronization, replication, indexing etc.

vchendrix avatar Jul 11 '19 18:07 vchendrix

This needs some terminology changes but it is an idea of what we are going for

ESS-DIVE Data Audit Report

Includes published data objects created between 2018-01-01 00:00:00 and 2019-08-16 00:00:00. 

This audit report is produced on a monthly basis, or as-needed, to report the overall health of the ESS-DIVE repository. The source of data for this report is the output of continuous processes that replicate and independently verify the intactness of data objects at each ESS-DIVE repository instance. 

Legend:
HEALTHY: Verified intact on primary and all replica instances.
DEGRADED: Verified intact on primary, but verified corrupt (or unable to verify) on at least one replica instance; expected to self-heal under normal conditions.
AT RISK: Verified corrupt (or unable to verify) on primary and/or all replica instances; intervention needed to restore data objects from offline (backup) copies.

********************************************************

Summary:

Data objects with status HEALTHY:    5851  (96.39)
Data objects with status DEGRADED:   14  (0.23)
Data objects with status AT RISK:    205  (3.38)
----------------------------------------------------
Total data objects:                  6070 100%
********************************************************

Detail on DEGRADED data objects:

Replication partially complete (queued) (14)
Replication partially complete (failed) (0)
Checksum validation partially complete (queued) (0)
Checksum validation partially complete (failed) (0)
********************************************************

Detail on AT RISK data objects::

Replication incomplete; NO replicas exist (queued) (21)
Replication incomplete; NO replicas exist (failed) (0)
Checksum validation incomplete; NO replicas verified (queued) (184)
Checksum validation incomplete; NO replicas verified (failed) (0)
Checksum validation incomplete; source corrupt (0)
Auditing data incomplete; NO replicas verified (0)
********************************************************

vchendrix avatar Sep 05 '19 18:09 vchendrix

Val and Cory and I discussed this a bit, and decided we need to consider a few components:

  • A server side API to provide aggregated statistics, and the metrics-service may be appropriate. Consider folding in Hesham's report code into the metrics-service as another type of metric to be queried.
  • Wireframes of a MetacatUI view that would probably be an addition to the MN profile page, available to MN operator subjects.

csjx avatar Sep 05 '19 18:09 csjx

For reference, some related info that can be grabbed with existing monitoring:

https://monitor.dataone.org/status/ https://cn.dataone.org/processing_metrics.txt

The second is really a proof of concept but shows some valuable queue information.

mbjones avatar Sep 05 '19 20:09 mbjones