kyma
kyma copied to clipboard
JetStream is disabled on all NATS cluster nodes when the storage is full
Description
JetStream is disabled on all NATS cluster nodes when the storage is full.
Steps to reproduce
- Provision Kyma with evaluation or production profiles.
- Start a high pace load testing with more that 1.5K Events per second.
- Wait until the storage for NATS servers is full.
Side effects
-
The Eventing publisher proxy failed to publish events:
{ "caller":"nats/handler.go:195", "context":{ "after":"sap.kyma.custom.commerce.order.updated.v095", "before":"sap.kyma.custom.commerce.order.updated.v095", "duration":5.001188431, "id":"3b213ed6-918f-4229-af71-1b7862e2419e", "responseBody":"context deadline exceeded", "source":"kyma", "statusCode":500 }, "logger":"nats-handler", "message":"Event dispatched", "timestamp":"2022-08-18T14:28:18Z" }
-
The Eventing controller fails to reconciler Kyma subscriptions:
{ "caller":"controller/controller.go:326", "context":{ "controller":"jetstream-subscription-reconciler", "error":"context deadline exceeded", "name":"subscription-1", "namespace":"tunas", "object":{ "name":"subscription-1", "namespace":"tunas" }, "reconcileID":"13f1611a-dbf1-4bc7-8ce3-f21fe6d0bea9" }, "message":"Reconciler error", "timestamp":"2022-08-18T19:09:12Z" }
-
The Eventing dispatcher fails to deliver messages:
{ "log":"nats: consumer not active on connection [10] for subscription on \"kyma.sap.kyma.custom.commerce.order.created.v100\"", "time":"2022-08-18T19:09:44.970147465Z" }
@marcobebway Do you have any script for high pace load testing
? Just to make it as easy as possible to reproduce for the one who will be working on the ticket :)
I can confirm that using the load tester from @marcobebway we can reproduce this issue
Pre-requisite:
- https://github.com/kyma-project/kyma/issues/15751 - if using the discard/maxbytes stream configuration will solve this problem, we can close this issue then.
Otherwise, we came up with the following tasks:
- [ ] Introduce storage alerts after a certain threshold, reference
- [ ] Actively communicate this bug with NATS(by opening a bug or using official communication channels)
closing due to: #15998