kyma Verify dispatching behaviour to Jetstream

Description

Verify that EPP rejects messages in the following scenario:

Nats server works AND
Jetstream part in nats does not work

Acceptance

[ ] reproduce the scenario
[ ] run tests

Jul 29 '22 06:07 k15r

Investigate the context deadline exceeded message that occurred during the outage.

$ nats stream info
nats: error: could not pick a Stream to operate on: context deadline exceeded

Jul 29 '22 13:07 k15r

potential options to reproduce:

delete data folder in nats cluster container

Aug 08 '22 11:08 k15r

TB: 3d

Aug 08 '22 11:08 k15r

Test infrastructure

nats version 0.0.30
nats-server version 2.8.4
production profile, 3 replicas with file storage(1 GB), InterestPolicy

Scenario:

2 Sinks with the following values:

maxReplicas: 1
minReplicas: 1
resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 50m
    memory: 64Mi

2 Subscriptions with 10 filters each
Sending events in parallel, so in the end the publishing rate was 1100 events per second (faster than consumer can deliver) → js storage grows

Results After around 30-40 minutes(939 MiB of storage consumed) EPP stopped to accept new events and was returning:

{"level":"INFO","timestamp":"2022-08-17T07:23:21Z","logger":"nats-handler","caller":"nats/handler.go:195","message":"Event dispatched","context":{"id":"A2111-1111-1111","source":"noapp","before":"sap.kyma.custom.noapp.order.created.v1","after":"sap.kyma.custom.noapp.order.created.v1","statusCode":500,"duration":0.504081429,"responseBody":"nats: no response from stream"}}

or in short:

statusCode: 500
nats: no response from stream

NATS CLI on the nats stream ls command returned nats: error: could not list streams: JetStream system temporarily unavailable (10008), try --help

The last time the CLI was responding, that was the output:

I checked all the three replicas:

eventing-nats-0: used 100% of storage:

Filesystem                Size      Used Available Use% Mounted on
overlay                  48.1G      7.7G     38.4G  17% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                    14.7G         0     14.7G   0% /sys/fs/cgroup
/dev/sdg                973.4M    957.4M         0 100% /data

logs:

[32] 2022/08/16 06:07:41.318092 [INF] 100.64.0.55:54452 - rid:49 - Route connection created
[32] 2022/08/16 06:07:47.393696 [INF] JetStream cluster new metadata leader: eventing-nats-1/eventing-nats
[32] 2022/08/16 06:07:55.759515 [INF] 100.64.0.55:6222 - rid:71 - Route connection created
[32] 2022/08/16 06:07:55.759789 [INF] 100.64.0.55:6222 - rid:71 - Router connection closed: Duplicate Route
[32] 2022/08/16 06:07:58.934800 [INF] 100.64.0.56:33360 - rid:72 - Route connection created
[32] 2022/08/16 06:08:09.207150 [WRN] JetStream cluster consumer '$G > sap > 7dc03259c404acadba157faf767ecf83' has NO quorum, stalled.
[32] 2022/08/16 06:08:40.976160 [INF] 100.64.0.56:6222 - rid:74 - Route connection created
[32] 2022/08/16 06:08:40.976390 [INF] 100.64.0.56:6222 - rid:74 - Router connection closed: Duplicate Route
[32] 2022/08/16 07:58:31.069085 [ERR] JetStream failed to store a msg on stream '$G > sap': write /data/jetstream/$G/streams/sap/msgs/381.blk: no space left on device
[32] 2022/08/16 07:58:31.069109 [WRN] Error applying entries to '$G > sap': write /data/jetstream/$G/streams/sap/msgs/381.blk: no space left on device
[32] 2022/08/16 07:58:31.069120 [ERR] JetStream out of File resources, will be DISABLED
[32] 2022/08/16 07:58:31.070137 [INF] Initiating JetStream Shutdown...
[32] 2022/08/16 07:58:31.086371 [INF] JetStream Shutdown
[32] 2022/08/16 11:59:27.307609 [INF] 100.64.0.55:54452 - rid:49 - Router connection closed: Client Closed

eventing-nats-1(leader): used 96% of storage:

/ # df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  48.1G      7.7G     38.4G  17% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                    14.7G         0     14.7G   0% /sys/fs/cgroup
/dev/sdj                973.4M    914.7M     42.8M  96% /data

logs:
[32] 2022/08/16 13:23:47.331671 [INF] JetStream cluster no metadata leader
[32] 2022/08/16 13:23:48.351057 [WRN] JetStream cluster consumer '$G > sap > d4bc1887b4ec2f3950a463970e636a90' has NO quorum, stalled.
[32] 2022/08/16 13:23:48.435390 [WRN] JetStream cluster consumer '$G > sap > c74c20756af53b592f87edebff67bdf8' has NO quorum, stalled.
[32] 2022/08/16 13:23:49.609318 [WRN] JetStream cluster consumer '$G > sap > cd5f48da5ce5a301f47665227286d69c' has NO quorum, stalled.
[32] 2022/08/16 13:23:50.568794 [WRN] JetStream cluster consumer '$G > sap > a8abb8e8322b06ac9922a1222341de71' has NO quorum, stalled.
[32] 2022/08/16 13:23:51.513825 [WRN] JetStream cluster consumer '$G > sap > d0a08552e41db68d73d54f2e19c4c118' has NO quorum, stalled

eventing-nats-2: used 100% of storage

Filesystem                Size      Used Available Use% Mounted on
overlay                  48.1G      7.7G     38.4G  17% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                    14.7G         0     14.7G   0% /sys/fs/cgroup
/dev/sdj                973.4M    914.7M     42.8M  96% /data

[32] 2022/08/16 13:23:47.331671 [INF] JetStream cluster no metadata leader
[32] 2022/08/16 13:23:48.351057 [WRN] JetStream cluster consumer '$G > sap > d4bc1887b4ec2f3950a463970e636a90' has NO quorum, stalled.
[32] 2022/08/16 13:23:48.435390 [WRN] JetStream cluster consumer '$G > sap > c74c20756af53b592f87edebff67bdf8' has NO quorum, stalled.
[32] 2022/08/16 13:23:49.609318 [WRN] JetStream cluster consumer '$G > sap > cd5f48da5ce5a301f47665227286d69c' has NO quorum, stalled.
[32] 2022/08/16 13:23:50.568794 [WRN] JetStream cluster consumer '$G > sap > a8abb8e8322b06ac9922a1222341de71' has NO quorum, stalled.
[32] 2022/08/16 13:23:51.513825 [WRN] JetStream cluster consumer '$G > sap > d0a08552e41db68d73d54f2e19c4c118' has NO quorum, stalled

So, the problem was indeed the full storage, even though the nats SDK was reporting no response from stream instead of no space left on device.

NATS Health checks were ok:

{
	"status": "ok"
}

but the JetStream got disabled:

{
"server_id": "NAB3IMUKTXEIZ7QG3CRHYE3KXXZ3RFXHNWVZZ55DQ4P6ZIZTIIFRNAVB",
"now": "2022-08-16T13:26:58.286323634Z",
"disabled": **true**,
"config": {
"max_memory": 0,
"max_storage": 0
},
"memory": 0,
"storage": 0,
"reserved_memory": 0,
"reserved_storage": 0,
"accounts": 0,
"ha_assets": 0,
"api": {
"total": 0,
"errors": 0
},
"streams": 0,
"consumers": 0,
"messages": 0,
"bytes": 0
}

The both Subscriptions were in the false state b/c of:

conditions:
  - lastTransitionTime: "2022-08-16T09:10:19Z"
    message: 'nats: JetStream system temporarily unavailable'
    reason: NATS Subscription not active
    status: "False"
    type: Subscription active

and the stream had the following configuration:


                 Name: eventing-nats
               Leader: #
              Replica: eventing-nats-0, outdated, seen 7m5s ago, 1,851,676 operations behind
              Replica: eventing-nats-1, outdated, seen 3h8m7s ago, 1,851,676 operations behind
              Replica: eventing-nats-2, outdated, seen 5m44s ago, 1,851,676 operations behind

State:

             Messages: 1,436,508
                Bytes: 941 MiB
             FirstSeq: 29,997 @ 2022-08-17T12:48:52 UTC
              LastSeq: 1,851,674 @ 2022-08-17T14:59:25 UTC
     Deleted Messages: 385170
     Active Consumers: 20

we see, that the pods are outdated, the leader is not elected and a lot of 1400000+ undelivered messages.

here is what the dashboard displayed:

Trying to revive the JetStream

One way to revive the JetStream is to restart the nodes. I deleted all the three pods sequentially. Then JetStream got enabled and started to retry to dispatch the messages from the stream. To test the redelivery I turned on the sinks. One thing I observed is, that sinks with their current resources

maxReplicas: 1
minReplicas: 1
resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 50m
    memory: 64Mi

cannot keep up with 1000 requests per second redelivery rate, so the sinks kept shutting down under the load and were constantly restarting. So the sinks were the bottle-neck in case of redelivery:

Eventing-controller logs:

{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"ae34bd3a-b3b1-48d9-a1b7-38d59b80638b","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v030","sink":"http://sink-0.tunas.svc.cluster.local"}}
{"level":"INFO","timestamp":"2022-08-17T08:41:02Z","logger":"nats-handler","caller":"nats/handler.go:195","message":"Event dispatched","context":{"id":"1c9fe2ee-b070-4338-8a25-23cd95df121b","source":"kyma","before":"sap.kyma.custom.commerce.order.updated.v020","after":"sap.kyma.custom.commerce.order.updated.v020","statusCode":500,"duration":5.000078098,"responseBody":"context deadline exceeded"}}
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"36335f7f-2882-46f7-888a-f4dac2ce6cb9","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v050","sink":"http://sink-0.tunas.svc.cluster.local"}}
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"91cb4e9a-0a38-491a-914c-1b664d5118ba","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v050","sink":"http://sink-0.tunas.svc.cluster.local"}}
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"e56ec64b-bb4d-4413-97a0-4c5a8bcb1efc","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v050","sink":"http://sink-0.tunas.svc.cluster.local"}}
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"8eddd144-49b6-4077-b5dc-e7c5a33118ca","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v080","sink":"http://sink-0.tunas.svc.cluster.local"}}

some are dispatched, but with a very slow rate, b/c the sink started getting:

Warning  Unhealthy  57m (x2 over 57m)     kubelet            Startup probe failed: HTTP probe failed with statuscode: 500
Warning  Unhealthy  56m (x4 over 56m)     kubelet            Liveness probe failed: Get "http://100.64.0.86:15020/app-health/function/livez": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  17m (x98 over 56m)    kubelet            Readiness probe failed: Get "http://100.64.0.86:15020/app-health/function/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Scaling up the sinks to:

maxReplicas: 5
minReplicas: 1
resources:
  limits:
    cpu: 2000m
    memory: 256Mi
  requests:
    cpu: 100m
    memory: 128Mi

did the trick, and the sinks were able to receive the events without any loss.

Conclusions

When the PVC Storage gets full, the JetStream gets disabled. NATS doesn’t expose the right error message. Instead of propagating the no space left on device error, nats returns no response from stream , which causes confusion. As per NATS core maintainers, this is the expected behaviour (reference).
When JetStream is disabled, no events can be published to the stream, be it:

{"level":"INFO","timestamp":"2022-08-17T07:23:21Z","logger":"nats-handler","caller":"nats/handler.go:195","message":"Event dispatched","context":{"id":"A2111-1111-1111","source":"noapp","before":"sap.kyma.custom.noapp.order.created.v1","after":"sap.kyma.custom.noapp.order.created.v1","statusCode":500,"duration":0.504081429,"responseBody":"nats: no response from stream"}}

or 

{"level":"INFO","timestamp":"2022-08-17T08:41:03Z","logger":"nats-handler","caller":"nats/handler.go:195","message":"Event dispatched","context":{"id":"c7fa3968-2610-4a40-baf9-95e8c57f731a","source":"kyma","before":"sap.kyma.custom.commerce.order.updated.v030","after":"sap.kyma.custom.commerce.order.updated.v030","statusCode":500,"duration":5.00035585,"responseBody":"context deadline exceeded"}}

According to the current EPP code:

_, err = jsCtx.PublishMsg(msg)
	if err != nil {
		...
		return http.StatusInternalServerError, err
	}

the error will get propagated and no other messages will be accepted to the stream.

Tasks:

[ ] Change the EPP logs, so that if you get 500 from the Backend, no message like Event dispatched will be displayed
[ ] NATS is currently revising(reference) their error handling in nats.go to be able to inspect api errors via errors package. Let's check with them about that and introduce the better error handling/propagation here. https://github.com/kyma-project/kyma/issues/15150
[ ] Create a troubleshooting guide of how to handle the the case when JetStream gets disabled: be it full storage or context deadline exceeded. https://github.com/kyma-project/kyma/issues/15210
[ ] Currenly the grafana dashboard doesn't display the 100% usage of the storage, but the js gets disabled due to no storage left. Let's try to show the nearest possible usage percent in the dashboard if possible. https://github.com/kyma-project/kyma/issues/15211

Follow-up questions:

We have maxReconnects value in our controller, which defines, how many time will we try to reconnect to the backend. What if NATS Backend gets disabled due to some reason, the controller cannot connect to the NATS after some retries, then stops. NATS Backend becomes reachable after some time, but the controller is not reconnecting again ? So should we restart the eventing-controller ?

Aug 17 '22 13:08 vpaskar

kyma kyma copied to clipboard

Verify dispatching behaviour to Jetstream

Description

Acceptance

kyma
kyma copied to clipboard