numaflow
numaflow copied to clipboard
JetStream isbsvc healthiness metric
Summary
As an on-call engineer, I want to get immediately alerted when my JetStream ISB service is in an unhealthy state. The goal is to detect JetStream issues early to prevent bigger impact like vertex/pipeline/isbsvc being down, data loss etc.
How
We need to define the criteria of a platform side JetStream issue. The data can come from existing numaflow metrics, K8s metrics, or logs, etc. We can define a list of metrics indicating JetStream is in a bad state and in our dashboard, we can define a single metric aggregating all of them and set alert on that metric.
Context
numalogic team got the following errors:
{ [-]
bufferWriter: ***-***-0
caller: jetstream/writer.go:99
error: nats: JetStream system temporarily unavailable
level: error
logger: numaflow.ReduceUDF-processor
msg: Failed to get consumer info in the writer
partitionIdx: 0
pipeline: ***-***-pl
stacktrace: github.com/numaproj/numaflow/pkg/isb/stores/jetstream.(*jetStreamWriter).runStatusChecker.func1
/home/runner/work/numaflow/numaflow/pkg/isb/stores/jetstream/writer.go:99
github.com/numaproj/numaflow/pkg/isb/stores/jetstream.(*jetStreamWriter).runStatusChecker
/home/runner/work/numaflow/numaflow/pkg/isb/stores/jetstream/writer.go:134
stream: ***-0
subject: ***-0
ts: 2024-08-30T20:57:29.815935961Z
vertex: window
}
The error lasted for 7 hours, making isbsvc in a bad state, which caused an issue when there is a node rotation. After the node rotation, we started seeing Error storing entry to WAL: raft: could not store entry to WAL
error, which eventually causing the pipeline stopping progressing. If our on-call gets alerted on this error, we should have taken actions to resolve it and prevent the the issue.
The JetStream system temporarily unavailable
is just one example of unhealthiness, there can be others, like daemon server getting nil pointer from kv store etc.
Message from the maintainers:
If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.