kyma
kyma copied to clipboard
Verify dispatching behaviour to Jetstream
Description
Verify that EPP rejects messages in the following scenario:
- Nats server works AND
- Jetstream part in nats does not work
Acceptance
- [ ] reproduce the scenario
- [ ] run tests
Investigate the context deadline exceeded
message that occurred during the outage.
$ nats stream info
nats: error: could not pick a Stream to operate on: context deadline exceeded
potential options to reproduce:
- delete data folder in nats cluster container
TB: 3d
Test infrastructure
- nats version 0.0.30
- nats-server version 2.8.4
- production profile, 3 replicas with file storage(1 GB), InterestPolicy
Scenario:
- 2 Sinks with the following values:
maxReplicas: 1
minReplicas: 1
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 50m
memory: 64Mi
-
2 Subscriptions with 10 filters each
-
Sending events in parallel, so in the end the publishing rate was 1100 events per second (faster than consumer can deliver) → js storage grows
Results After around 30-40 minutes(939 MiB of storage consumed) EPP stopped to accept new events and was returning:
{"level":"INFO","timestamp":"2022-08-17T07:23:21Z","logger":"nats-handler","caller":"nats/handler.go:195","message":"Event dispatched","context":{"id":"A2111-1111-1111","source":"noapp","before":"sap.kyma.custom.noapp.order.created.v1","after":"sap.kyma.custom.noapp.order.created.v1","statusCode":500,"duration":0.504081429,"responseBody":"nats: no response from stream"}}
or in short:
statusCode: 500
nats: no response from stream
NATS CLI on the nats stream ls
command returned nats: error: could not list streams: JetStream system temporarily unavailable (10008), try --help
The last time the CLI was responding, that was the output:
I checked all the three replicas:
eventing-nats-0
: used 100% of storage:
Filesystem Size Used Available Use% Mounted on
overlay 48.1G 7.7G 38.4G 17% /
tmpfs 64.0M 0 64.0M 0% /dev
tmpfs 14.7G 0 14.7G 0% /sys/fs/cgroup
/dev/sdg 973.4M 957.4M 0 100% /data
logs:
[32] 2022/08/16 06:07:41.318092 [INF] 100.64.0.55:54452 - rid:49 - Route connection created
[32] 2022/08/16 06:07:47.393696 [INF] JetStream cluster new metadata leader: eventing-nats-1/eventing-nats
[32] 2022/08/16 06:07:55.759515 [INF] 100.64.0.55:6222 - rid:71 - Route connection created
[32] 2022/08/16 06:07:55.759789 [INF] 100.64.0.55:6222 - rid:71 - Router connection closed: Duplicate Route
[32] 2022/08/16 06:07:58.934800 [INF] 100.64.0.56:33360 - rid:72 - Route connection created
[32] 2022/08/16 06:08:09.207150 [WRN] JetStream cluster consumer '$G > sap > 7dc03259c404acadba157faf767ecf83' has NO quorum, stalled.
[32] 2022/08/16 06:08:40.976160 [INF] 100.64.0.56:6222 - rid:74 - Route connection created
[32] 2022/08/16 06:08:40.976390 [INF] 100.64.0.56:6222 - rid:74 - Router connection closed: Duplicate Route
[32] 2022/08/16 07:58:31.069085 [ERR] JetStream failed to store a msg on stream '$G > sap': write /data/jetstream/$G/streams/sap/msgs/381.blk: no space left on device
[32] 2022/08/16 07:58:31.069109 [WRN] Error applying entries to '$G > sap': write /data/jetstream/$G/streams/sap/msgs/381.blk: no space left on device
[32] 2022/08/16 07:58:31.069120 [ERR] JetStream out of File resources, will be DISABLED
[32] 2022/08/16 07:58:31.070137 [INF] Initiating JetStream Shutdown...
[32] 2022/08/16 07:58:31.086371 [INF] JetStream Shutdown
[32] 2022/08/16 11:59:27.307609 [INF] 100.64.0.55:54452 - rid:49 - Router connection closed: Client Closed
eventing-nats-1
(leader): used 96% of storage:
/ # df -h
Filesystem Size Used Available Use% Mounted on
overlay 48.1G 7.7G 38.4G 17% /
tmpfs 64.0M 0 64.0M 0% /dev
tmpfs 14.7G 0 14.7G 0% /sys/fs/cgroup
/dev/sdj 973.4M 914.7M 42.8M 96% /data
logs:
[32] 2022/08/16 13:23:47.331671 [INF] JetStream cluster no metadata leader
[32] 2022/08/16 13:23:48.351057 [WRN] JetStream cluster consumer '$G > sap > d4bc1887b4ec2f3950a463970e636a90' has NO quorum, stalled.
[32] 2022/08/16 13:23:48.435390 [WRN] JetStream cluster consumer '$G > sap > c74c20756af53b592f87edebff67bdf8' has NO quorum, stalled.
[32] 2022/08/16 13:23:49.609318 [WRN] JetStream cluster consumer '$G > sap > cd5f48da5ce5a301f47665227286d69c' has NO quorum, stalled.
[32] 2022/08/16 13:23:50.568794 [WRN] JetStream cluster consumer '$G > sap > a8abb8e8322b06ac9922a1222341de71' has NO quorum, stalled.
[32] 2022/08/16 13:23:51.513825 [WRN] JetStream cluster consumer '$G > sap > d0a08552e41db68d73d54f2e19c4c118' has NO quorum, stalled
eventing-nats-2
: used 100% of storage
Filesystem Size Used Available Use% Mounted on
overlay 48.1G 7.7G 38.4G 17% /
tmpfs 64.0M 0 64.0M 0% /dev
tmpfs 14.7G 0 14.7G 0% /sys/fs/cgroup
/dev/sdj 973.4M 914.7M 42.8M 96% /data
[32] 2022/08/16 13:23:47.331671 [INF] JetStream cluster no metadata leader
[32] 2022/08/16 13:23:48.351057 [WRN] JetStream cluster consumer '$G > sap > d4bc1887b4ec2f3950a463970e636a90' has NO quorum, stalled.
[32] 2022/08/16 13:23:48.435390 [WRN] JetStream cluster consumer '$G > sap > c74c20756af53b592f87edebff67bdf8' has NO quorum, stalled.
[32] 2022/08/16 13:23:49.609318 [WRN] JetStream cluster consumer '$G > sap > cd5f48da5ce5a301f47665227286d69c' has NO quorum, stalled.
[32] 2022/08/16 13:23:50.568794 [WRN] JetStream cluster consumer '$G > sap > a8abb8e8322b06ac9922a1222341de71' has NO quorum, stalled.
[32] 2022/08/16 13:23:51.513825 [WRN] JetStream cluster consumer '$G > sap > d0a08552e41db68d73d54f2e19c4c118' has NO quorum, stalled
So, the problem was indeed the full storage, even though the nats SDK was reporting no response from stream
instead of no space left on device
.
NATS Health checks were ok:
{
"status": "ok"
}
but the JetStream got disabled:
{
"server_id": "NAB3IMUKTXEIZ7QG3CRHYE3KXXZ3RFXHNWVZZ55DQ4P6ZIZTIIFRNAVB",
"now": "2022-08-16T13:26:58.286323634Z",
"disabled": **true**,
"config": {
"max_memory": 0,
"max_storage": 0
},
"memory": 0,
"storage": 0,
"reserved_memory": 0,
"reserved_storage": 0,
"accounts": 0,
"ha_assets": 0,
"api": {
"total": 0,
"errors": 0
},
"streams": 0,
"consumers": 0,
"messages": 0,
"bytes": 0
}
The both Subscriptions were in the false state b/c of:
conditions:
- lastTransitionTime: "2022-08-16T09:10:19Z"
message: 'nats: JetStream system temporarily unavailable'
reason: NATS Subscription not active
status: "False"
type: Subscription active
and the stream had the following configuration:
Name: eventing-nats
Leader: #
Replica: eventing-nats-0, outdated, seen 7m5s ago, 1,851,676 operations behind
Replica: eventing-nats-1, outdated, seen 3h8m7s ago, 1,851,676 operations behind
Replica: eventing-nats-2, outdated, seen 5m44s ago, 1,851,676 operations behind
State:
Messages: 1,436,508
Bytes: 941 MiB
FirstSeq: 29,997 @ 2022-08-17T12:48:52 UTC
LastSeq: 1,851,674 @ 2022-08-17T14:59:25 UTC
Deleted Messages: 385170
Active Consumers: 20
we see, that the pods are outdated, the leader is not elected and a lot of 1400000+ undelivered messages.
here is what the dashboard displayed:
Trying to revive the JetStream
One way to revive the JetStream is to restart the nodes. I deleted all the three pods sequentially. Then JetStream got enabled and started to retry to dispatch the messages from the stream. To test the redelivery I turned on the sinks. One thing I observed is, that sinks with their current resources
maxReplicas: 1
minReplicas: 1
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 50m
memory: 64Mi
cannot keep up with 1000 requests per second redelivery rate, so the sinks kept shutting down under the load and were constantly restarting. So the sinks were the bottle-neck in case of redelivery:
Eventing-controller logs:
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"ae34bd3a-b3b1-48d9-a1b7-38d59b80638b","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v030","sink":"http://sink-0.tunas.svc.cluster.local"}}
{"level":"INFO","timestamp":"2022-08-17T08:41:02Z","logger":"nats-handler","caller":"nats/handler.go:195","message":"Event dispatched","context":{"id":"1c9fe2ee-b070-4338-8a25-23cd95df121b","source":"kyma","before":"sap.kyma.custom.commerce.order.updated.v020","after":"sap.kyma.custom.commerce.order.updated.v020","statusCode":500,"duration":5.000078098,"responseBody":"context deadline exceeded"}}
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"36335f7f-2882-46f7-888a-f4dac2ce6cb9","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v050","sink":"http://sink-0.tunas.svc.cluster.local"}}
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"91cb4e9a-0a38-491a-914c-1b664d5118ba","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v050","sink":"http://sink-0.tunas.svc.cluster.local"}}
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"e56ec64b-bb4d-4413-97a0-4c5a8bcb1efc","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v050","sink":"http://sink-0.tunas.svc.cluster.local"}}
{"level":"ERROR","timestamp":"2022-08-17T07:41:04Z","logger":"jetstream-handler","caller":"handlers/jetstream.go:473","message":"Failed to dispatch the CloudEvent","context":{"id":"8eddd144-49b6-4077-b5dc-e7c5a33118ca","source":"kyma","type":"sap.kyma.custom.commerce.order.created.v080","sink":"http://sink-0.tunas.svc.cluster.local"}}
some are dispatched, but with a very slow rate, b/c the sink started getting:
Warning Unhealthy 57m (x2 over 57m) kubelet Startup probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 56m (x4 over 56m) kubelet Liveness probe failed: Get "http://100.64.0.86:15020/app-health/function/livez": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 17m (x98 over 56m) kubelet Readiness probe failed: Get "http://100.64.0.86:15020/app-health/function/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Scaling up the sinks to:
maxReplicas: 5
minReplicas: 1
resources:
limits:
cpu: 2000m
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
did the trick, and the sinks were able to receive the events without any loss.
Conclusions
- When the PVC Storage gets full, the JetStream gets disabled. NATS doesn’t expose the right error message. Instead of propagating the
no space left on device
error, nats returnsno response from stream
, which causes confusion. As per NATS core maintainers, this is the expected behaviour (reference). - When JetStream is disabled, no events can be published to the stream, be it:
{"level":"INFO","timestamp":"2022-08-17T07:23:21Z","logger":"nats-handler","caller":"nats/handler.go:195","message":"Event dispatched","context":{"id":"A2111-1111-1111","source":"noapp","before":"sap.kyma.custom.noapp.order.created.v1","after":"sap.kyma.custom.noapp.order.created.v1","statusCode":500,"duration":0.504081429,"responseBody":"nats: no response from stream"}}
or
{"level":"INFO","timestamp":"2022-08-17T08:41:03Z","logger":"nats-handler","caller":"nats/handler.go:195","message":"Event dispatched","context":{"id":"c7fa3968-2610-4a40-baf9-95e8c57f731a","source":"kyma","before":"sap.kyma.custom.commerce.order.updated.v030","after":"sap.kyma.custom.commerce.order.updated.v030","statusCode":500,"duration":5.00035585,"responseBody":"context deadline exceeded"}}
According to the current EPP code:
_, err = jsCtx.PublishMsg(msg)
if err != nil {
...
return http.StatusInternalServerError, err
}
the error will get propagated and no other messages will be accepted to the stream.
Tasks:
- [ ] Change the EPP logs, so that if you get 500 from the Backend, no message like
Event dispatched
will be displayed - [ ] NATS is currently revising(reference) their error handling in nats.go to be able to inspect api errors via errors package. Let's check with them about that and introduce the better error handling/propagation here. https://github.com/kyma-project/kyma/issues/15150
- [ ] Create a troubleshooting guide of how to handle the the case when JetStream gets disabled: be it full storage or context deadline exceeded. https://github.com/kyma-project/kyma/issues/15210
- [ ] Currenly the grafana dashboard doesn't display the 100% usage of the storage, but the js gets disabled due to no storage left. Let's try to show the nearest possible usage percent in the dashboard if possible. https://github.com/kyma-project/kyma/issues/15211
Follow-up questions:
- We have
maxReconnects
value in our controller, which defines, how many time will we try to reconnect to the backend. What if NATS Backend gets disabled due to some reason, the controller cannot connect to the NATS after some retries, then stops. NATS Backend becomes reachable after some time, but the controller is not reconnecting again ? So should we restart the eventing-controller ?