kyma icon indicating copy to clipboard operation
kyma copied to clipboard

JetStream is disabled on all NATS cluster nodes when the storage is full

Open marcobebway opened this issue 2 years ago • 3 comments

Description

JetStream is disabled on all NATS cluster nodes when the storage is full.

Steps to reproduce

  • Provision Kyma with evaluation or production profiles.
  • Start a high pace load testing with more that 1.5K Events per second.
  • Wait until the storage for NATS servers is full.

Side effects

  • The Eventing publisher proxy failed to publish events:

    {
        "caller":"nats/handler.go:195",
        "context":{
           "after":"sap.kyma.custom.commerce.order.updated.v095",
           "before":"sap.kyma.custom.commerce.order.updated.v095",
           "duration":5.001188431,
           "id":"3b213ed6-918f-4229-af71-1b7862e2419e",
           "responseBody":"context deadline exceeded",
           "source":"kyma",
           "statusCode":500
        },
        "logger":"nats-handler",
        "message":"Event dispatched",
        "timestamp":"2022-08-18T14:28:18Z"
     }
    
  • The Eventing controller fails to reconciler Kyma subscriptions:

    {
        "caller":"controller/controller.go:326",
        "context":{
           "controller":"jetstream-subscription-reconciler",
           "error":"context deadline exceeded",
           "name":"subscription-1",
           "namespace":"tunas",
           "object":{
              "name":"subscription-1",
              "namespace":"tunas"
           },
           "reconcileID":"13f1611a-dbf1-4bc7-8ce3-f21fe6d0bea9"
        },
        "message":"Reconciler error",
        "timestamp":"2022-08-18T19:09:12Z"
     }
    
  • The Eventing dispatcher fails to deliver messages:

    {
        "log":"nats: consumer not active on connection [10] for subscription on \"kyma.sap.kyma.custom.commerce.order.created.v100\"",
        "time":"2022-08-18T19:09:44.970147465Z"
    }
    

marcobebway avatar Aug 22 '22 16:08 marcobebway

@marcobebway Do you have any script for high pace load testing? Just to make it as easy as possible to reproduce for the one who will be working on the ticket :)

nachtmaar avatar Aug 23 '22 10:08 nachtmaar

I can confirm that using the load tester from @marcobebway we can reproduce this issue

nachtmaar avatar Sep 30 '22 11:09 nachtmaar

Pre-requisite:

  • https://github.com/kyma-project/kyma/issues/15751 - if using the discard/maxbytes stream configuration will solve this problem, we can close this issue then.

Otherwise, we came up with the following tasks:

  • [ ] Introduce storage alerts after a certain threshold, reference
  • [ ] Actively communicate this bug with NATS(by opening a bug or using official communication channels)

vpaskar avatar Oct 14 '22 13:10 vpaskar

closing due to: #15998

k15r avatar Nov 03 '22 09:11 k15r