nats-surveyor Surveyor returns only metrics for number of expected servers

When configured with the number of expected servers the surveyor will return only metrics up to a maximum of this configured number. Example: Our surveyor is connected to a leafnode server, which has a connection to a cluster in Azure. When configured with the correct number of all servers it will return the metrics for all these servers. When configured with expected count = 1, it will only return metrics for this single server, it is connected too. Furthermore metrics of additionally detected servers will also not be provided. We are running the nats-surveyor on a Windows 2016 Server using the latest version.

Aug 28 '20 05:08 wutkemtt

That’s the intended behaviour yes. You tell it to expect n servers and it stops when it reaches that expectation.

Aug 28 '20 05:08 ripienaar

Okay, than I don't understand why it is called surveyor. I expect that it surveys the whole server and cluster structure and returns informations about all servers it finds. Possibly the configuration option is misleading. Would it be better to call it survey depth instead of expected count ?

Aug 28 '20 07:08 wutkemtt

No I don’t think so. When Prometheus asks it for data we publish a request for data and respond to Prometheus soon as we heard back from your servers.

I agree having to set expected is annoying though and we are thinking of ways to streamline that, but for now when faced with an unknown number of servers we have to know when we reached a healthy response count so we can give the data to prom ASAP.

Aug 28 '20 07:08 ripienaar

How is "healthy" defined? A helpful monitoring tool has to return the real current state of the cluster, regardless of the number of expected servers. If a server does not respond inside the configured time frame, it is down for monitoring or has at least a significant problem. In that case I want to know to fire an alert to our team, so someone can take a look at the problem. Currently I see the nats-surveyor as a helpful probe for additional infomation, but not as a replacement for the prometheus-exporter, as discussed in the other issue.

Aug 28 '20 07:08 wutkemtt

Surveyor is limited by its design from a single point of view. We do have a more traditional prom exporter that you run against every single server on the same host and there you get a more per server centric view.

Surveyor is network side - from its perspective - and supports a number of network wide views like audits and latency metrics that wouldn’t make sense for a per server view.

I agree it’s a bit of weird thing and I would say for sure surveyor does not remove the need for individual node monitoring. It’s a overall view of the world and an aggregator of certain data like audits and latencies.

Aug 28 '20 08:08 ripienaar

I agree, that's why it is so helpful in providing additional data we would never get when using the prometheus exporters only.

Aug 28 '20 08:08 wutkemtt

So anyway. There is only 2 options either we poll and always wait for poll timeout no matter how many machine respond. Or only wait for expected.

The timeout based approach would work better in dynamically scaled environments and we should probably support that model

Neither would yield the most amazing results though imo

Aug 28 '20 08:08 ripienaar

I think you can achieve desired behavior today by setting a large expected count and tune the timeout appropriately. The downside is that logs would grow with missing server error messages.

e.g. nats-surveyor -c 9999 -timeout 250ms -creds test/SYS.creds (with the proper credentials and and timeout).

Perhaps we can define -1 as the expected server count to be undefined, which does what it does above sans log errors.

Aug 28 '20 15:08 ColinSullivan1

Yeah, the -1 is what I was going to add I think

Aug 28 '20 15:08 ripienaar

nats-surveyor nats-surveyor copied to clipboard

Surveyor returns only metrics for number of expected servers

nats-surveyor
nats-surveyor copied to clipboard