Troubleshoting a performance issue
Hi!
Taking you up on your offer in 588#issuecomment-2622127368. We're scaling up our gNMI(c) usage and are starting to see performances issues. Before we throw more CPU or RAM to the issue I was wondering what we could do to troubleshot it further.
We use one gNMIc instance per DC, those instances are configured in "subscribe" mode and are scraped by Prometheus. You can see our configuration on GitHub.
The symptoms is that we're seeing gaps in the metrics for our largest DC, as mentioned in our "internal" ticket. Investigating further, we see that there is a high variability of metrics scraped from our busiest gNMIc instance, but no issues on our 2nd busiest.
However we can't pinpoint the root cause of that variability. We tried to exploit all the server's, go's or gNMIc' health metrics:
- https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?orgId=1&from=now-1h&to=now&var-site=eqiad
- https://grafana.wikimedia.org/d/000000377/host-overview?var-server=netflow1002
- https://grafana.wikimedia.org/d/CgCw8jKZz/go-metrics?orgId=1&var-job=gnmic&var-instance=netflow1002:7890
But can't figure out if it's missing threads, CPU, RAM, or anything else.
And further, if any processor can be optimized (if the issue comes from one of them, like event-value-tag-v2).
Do you have any pointers or rule of thumbs on what we should improve to be able to scale up our gNMIc instance?
Thanks
For a bit more background we had similar gaps/performance issues before, and Karim was kind enough to help us out (in #588). He added event-value-tag-v2 which improved the situation there. I suppose this previous experience is what makes us think the issue may be due to the performance of the processors, but that's only a suspicion.
Since moving to event-value-tag-v2 we did have a further issue, with gaps in metrics seemingly performance related, which we solved by increasing the num-workers enabled for the prometheus output (see our ticket here for more info).
Some progress thanks to that comment : https://github.com/openconfig/gnmic/issues/498#issuecomment-2263694440
Bumping num-workers to 12 (first blue line) didn't improve it, but to 16 (second blue line) seems to have done the job.
I'd love to understand better how to fine tune num-workers and how high we can increase it.
For example would it possible to have a Prometheus metric to track the number of workers being used at a given time ?
just for my understanding, @XioNoX, are you running gnmic already in a cluster, or is it a single instance?
just for my understanding, @XioNoX, are you running gnmic already in a cluster, or is it a single instance?
We're running it as a single instance.
The number of workers is simply the number of incoming (gNMI) messages that the Prometheus output can process in parallel. "Process" here means converting to event format, apply the processors, calculate a unique hash key of the event and storing it in an internal cache (that cache is read when a scrape request is received).
You were seeing a lot of goroutines being spawned because all the workers were busy, the queue filled up so the writing routine stalls (as a goroutine) and eventually times out resulting in missing metrics.
If the goal is to have a worker available for each message received and we know that routers send messages in sequence, you can start at a number of workers equal to the number of targets. At that point if there are no issues reduce the number of workers until you hit some timeouts. If there are issues the choke point is probably processors i.e a single worker is not enough to handle a single target's stream(s). With too many workers you will hit diminishing returns since writing to the internal cache is behind a common lock.
But I think you figure it out, 16 sounds like it did the trick.
For example would it possible to have a Prometheus metric to track the number of workers being used at a given time ?
The number of workers is a static number, they are all spawned when the output starts and remain idle until there is a message in the queue.
Thank you for the clarification, that helps a lot !