numaflow icon indicating copy to clipboard operation
numaflow copied to clipboard

JetStream isbsvc healthiness metric

Open KeranYang opened this issue 5 months ago • 0 comments

Summary

As an on-call engineer, I want to get immediately alerted when my JetStream ISB service is in an unhealthy state. The goal is to detect JetStream issues early to prevent bigger impact like vertex/pipeline/isbsvc being down, data loss etc.

How

We need to define the criteria of a platform side JetStream issue. The data can come from existing numaflow metrics, K8s metrics, or logs, etc. We can define a list of metrics indicating JetStream is in a bad state and in our dashboard, we can define a single metric aggregating all of them and set alert on that metric.

Context

numalogic team got the following errors:

{ [-]
   bufferWriter: ***-***-0
   caller: jetstream/writer.go:99
   error: nats: JetStream system temporarily unavailable
   level: error
   logger: numaflow.ReduceUDF-processor
   msg: Failed to get consumer info in the writer
   partitionIdx: 0
   pipeline: ***-***-pl
   stacktrace: github.com/numaproj/numaflow/pkg/isb/stores/jetstream.(*jetStreamWriter).runStatusChecker.func1
	/home/runner/work/numaflow/numaflow/pkg/isb/stores/jetstream/writer.go:99
github.com/numaproj/numaflow/pkg/isb/stores/jetstream.(*jetStreamWriter).runStatusChecker
	/home/runner/work/numaflow/numaflow/pkg/isb/stores/jetstream/writer.go:134
   stream: ***-0
   subject: ***-0
   ts: 2024-08-30T20:57:29.815935961Z
   vertex: window
}

The error lasted for 7 hours, making isbsvc in a bad state, which caused an issue when there is a node rotation. After the node rotation, we started seeing Error storing entry to WAL: raft: could not store entry to WAL error, which eventually causing the pipeline stopping progressing. If our on-call gets alerted on this error, we should have taken actions to resolve it and prevent the the issue.

The JetStream system temporarily unavailable is just one example of unhealthiness, there can be others, like daemon server getting nil pointer from kv store etc.


Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

KeranYang avatar Sep 05 '24 20:09 KeranYang