keda icon indicating copy to clipboard operation
keda copied to clipboard

JetStream scaler query the stream consumer's leader pod when clustered

Open rayjanoka opened this issue 1 year ago • 3 comments

I went to test drive the new JetStream scaler on my project and it started to have an issue. Eventually I figured out that the NATS monitoring endpoint isn't reporting accurate metrics on stream consumers from all pods in a cluster.

I found that when running a cluster of nats pods, only the stream consumer's leader pod reports the accurate number of messages in the queue. The jetstream scaler would work fine for me at first as long as keda was connected to the nats consumer leader pod, but as soon as I bounced keda's connection to a nats consumer replica pod the num_pending value there counts up but never down, so my deployment just scales up and up and up.

➜ nats consumer info test-stream durable

Cluster Information:
                Name: nats
              Leader: nats-1   <---- Leader Pod
             Replica: nats-0, current, seen 0.33s ago
             Replica: nats-2, current, seen 0.33s ago
# Consumer Leader (accurate)
➜ curl -s "http://nats-1.nats.nats.svc.cluster.local:8222/jsz?consumers=true" | grep num_pending
              "num_pending": 0,

# Replicas (counting up only)
➜ curl -s "http://nats-0.nats.nats.svc.cluster.local:8222/jsz?consumers=true" | grep num_pending
              "num_pending": 6,
➜ curl -s "http://nats-2.nats.nats.svc.cluster.local:8222/jsz?consumers=true" | grep num_pending
              "num_pending": 6,

We are able to discover the jetstream consumer's leader via the existing monitoring endpoint call.

$ curl -s "http://nats.nats.svc.cluster.local:8222/jsz?consumers=true" | jq '.account_details[0].stream_detail[0].consumer_detail[0].cluster.leader'
"nats-0"

To get the accurate count to the scaler I wrote a change to make a 2nd request directly to that consumer leader pod via the headless svc, ex consumer-leader-pod.nats.nats.svc.cluster.local.

NOTE: This fix will only work for clusters who have the same number of pods as stream replicas, ex. 3 pods and 3 stream replicas or 5 pods and 5 stream replicas.

➜ nats stream info test-stream
Information for Stream test-stream

Configuration:

             Subjects: sub
             Replicas: 3   <---- stream replica setting here

If there are less stream replicas than pods, ex. 3 pods and only 1 stream replica, NATS will only display metrics for that consumer on the single pod that the stream is assigned to, the other pods have no record of the consumer at all so it will only find the metric if the k8s svc happens to terminate you to that particular pod. Without any other clues to figure out which pod has the metric I think we'd have to rotate through each NATS pod blindly until we found the pod, not great. (I can document this limitation in the keda-docs for now)

@goku321 for visibility - thanks for contributing this!

I'll send a ticket over to the NATS side as well and see if they can work to provide accurate metrics across all NATS servers in a cluster so no matter what server we connect to we see everything.

Checklist

  • [X] Commits are signed with Developer Certificate of Origin (DCO - learn more)
  • [??] Tests have been added
  • [NA] A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
  • [TBD] A PR is opened to update the documentation on (repo) (if applicable)
  • [X] Changelog has been updated and is aligned with our changelog requirements

Fixes #

Relates to #

rayjanoka avatar Aug 18 '22 18:08 rayjanoka

@rayjanoka Thank you so much for giving it a test drive! Really appreciate all your efforts :)

Now, I'm thinking if we really need the single server JetStream tests and instead add tests for clustered JetStream. What do you think?

goku321 avatar Aug 18 '22 19:08 goku321

@rayjanoka Thank you so much for giving it a test drive! Really appreciate all your efforts :)

Now, I'm thinking if we really need the single server JetStream tests and instead add tests for clustered JetStream. What do you think?

@goku321 nice! I did notice we had separate tests like that for redis and I think it is probably a good idea here as well. I don't have much experience writing tests, and it might take me a while to circle back to this to give those a shot. If you or anyone else would like to take a crack at it that would be fine.

rayjanoka avatar Aug 18 '22 20:08 rayjanoka

@rayjanoka Sure. I can start with the tests in a week or so. I hope that's okay :)

goku321 avatar Aug 21 '22 18:08 goku321