netdata icon indicating copy to clipboard operation
netdata copied to clipboard

[Feat] [Use-cases]: Consul Monitoring with Netdata

Open sashwathn opened this issue 3 years ago • 6 comments

Problem

Netdata currently does not support monitoring Consul applications / metrics and we now need to support this use-case.

Description

KEY METRICS

CONSUL

  • [ ] Number of Consul Servers by DC
  • [ ] Consul Nodes by Service Toplist
  • [ ] Leader last contact with followers
  • [ ] Number of Consul nodes by service
  • [ ] Number of Consul nodes
  • [ ] Leader time to append entries
  • [ ] Number of Consul leaders (Leadership transition event overlay)
  • [ ] Consul cluster join and failure
  • [ ] Latency of leader commit to disk
  • [ ] Consul raft commit time
  • [ ] Time to reconcile members
  • [ ] Consul event queue
  • [ ] Consul Events

LEADERSHIP TRANSITIONS

  • [ ] Number of raft peers
  • [x] consul.raft.state.leader
  • [x] consul.raft.leader.lastContact
  • [ ] CPU idle

RAFT COMMITS

  • [x] consul.raft.apply
  • [x] consul.raft.commitTime
  • [ ] consul.raft.rpc.appendEntries
  • [ ] consul.leader.reconcile
  • [ ] consul.raft.leader.dispatchLog
  • [ ] consul.raft.leader.dispatchLog
  • [ ] consul.raft.snapshot.create

GOSSIP METRICS

  • [ ] consul.serf.member.flap

COMMUNICATION METRICS

  • [ ] consul.http.<VERB>.<ENDPOINT>
  • [ ] consul.http.<VERB>.<ENDPOINT>
  • [ ] consul.rpc.request
  • [ ] consul.rpc.query
  • [ ] consul.rpc.request_error
  • [x] consul.client.rpc
  • [x] consul.client.rpc.failed
  • [x] consul.client.rpc.exceeded
  • [ ] consul.rpc.cross-dc
  • [ ] consul.dns.domain_query.<NODE>

STATE METRICS

  • [x] consul.runtime.alloc_bytes
  • [ ] Free System RAM
  • [x] consul.runtime.sys_bytes
  • [ ] consul.kvs.apply.avg
  • [x] consul.kvs.apply.count
  • [x] consul.kvs.apply

OTHER

  • [ ] consul.check
  • [ ] consul.up
  • [ ] consul.can_connect
  • [ ] consul.new_leader

ENVOY

  • [ ] Requests
  • [ ] Total connections per cluster
  • [ ] Bytes received and send (B/s)
  • [ ] Total active clusters
  • [ ] Total warming clusters
  • [ ] Total pending requests per cluster
  • [ ] Total cluster membership update successes

As well, here are other areas for potential integration between Netdata and HashiCorp:

cc: @ktsaou @cakrit @shyamvalsan @amalkov @ralphm

Importance

must have

Value proposition

  1. Allows Consul users to monitor various metrics
  2. Provides value as a new use-case to monitor via Netdata

Proposed implementation

No response

sashwathn avatar Aug 29 '22 13:08 sashwathn

@ilyam8 @vlvkobal @thiagoftsm : Can you please take a look at this and see if any information is missing?

The consul metrics are available through APIs mentioned here and should be relatively easy to implement a collector for these -

  • https://www.consul.io/api-docs/agent.
  • https://www.consul.io/docs/agent/telemetry
  • https://www.consul.io/docs/k8s/connect/observability/metrics

And this link provides information about enabling telemetry and once we have our collector up and running we should check with Consul and add Netdata as a supported Telemetry agent: https://learn.hashicorp.com/tutorials/consul/monitor-datacenter-health?in=consul/day-2-operations

sashwathn avatar Sep 01 '22 09:09 sashwathn

@sashwathn this request is not easy as it may look:

  • exposed set of telemetry metrics depends on setup (not a specialist, perhaps depends on the number of instances in a cluster, on mode (agent), etc.). The list I get when querying your instance:
click
            "Name": "consul.autopilot.failure_tolerance",
            "Name": "consul.autopilot.healthy",
            "Name": "consul.consul.members.clients",
            "Name": "consul.consul.members.servers",
            "Name": "consul.consul.state.config_entries",
            "Name": "consul.consul.state.config_entries",
            "Name": "consul.consul.state.config_entries",
            "Name": "consul.consul.state.config_entries",
            "Name": "consul.consul.state.config_entries",
            "Name": "consul.consul.state.config_entries",
            "Name": "consul.consul.state.config_entries",
            "Name": "consul.consul.state.config_entries",
            "Name": "consul.consul.state.config_entries",
            "Name": "consul.consul.state.config_entries",
            "Name": "consul.consul.state.connect_instances",
            "Name": "consul.consul.state.connect_instances",
            "Name": "consul.consul.state.connect_instances",
            "Name": "consul.consul.state.connect_instances",
            "Name": "consul.consul.state.connect_instances",
            "Name": "consul.consul.state.kv_entries",
            "Name": "consul.consul.state.nodes",
            "Name": "consul.consul.state.peerings",
            "Name": "consul.consul.state.service_instances",
            "Name": "consul.consul.state.services",
            "Name": "consul.raft.applied_index",
            "Name": "consul.raft.commitNumLogs",
            "Name": "consul.raft.last_index",
            "Name": "consul.raft.leader.dispatchNumLogs",
            "Name": "consul.raft.leader.oldestLogAge",
            "Name": "consul.runtime.alloc_bytes",
            "Name": "consul.runtime.free_count",
            "Name": "consul.runtime.heap_objects",
            "Name": "consul.runtime.malloc_count",
            "Name": "consul.runtime.num_goroutines",
            "Name": "consul.runtime.sys_bytes",
            "Name": "consul.runtime.total_gc_pause_ns",
            "Name": "consul.runtime.total_gc_runs",
            "Name": "consul.server.isLeader",
            "Name": "consul.session_ttl.active",
            "Name": "consul.client.rpc",
            "Name": "consul.raft.apply",
            "Name": "consul.rpc.request",
            "Name": "consul.api.http",
            "Name": "consul.api.http",
            "Name": "consul.fsm.coordinate.batch-update",
            "Name": "consul.memberlist.gossip",
            "Name": "consul.memberlist.gossip",
            "Name": "consul.raft.commitTime",
            "Name": "consul.raft.fsm.apply",
            "Name": "consul.raft.fsm.enqueue",
            "Name": "consul.raft.leader.dispatchLog",
            "Name": "consul.raft.thread.fsm.saturation",
            "Name": "consul.raft.thread.main.saturation"
  • There are 4 metric types: Gauge, Counter, Points, and Samples. (my understanding that) The Sample does not have the current value, but "Rate", "Sum", "Min", "Max", "Mean", and "Stddev".

  • All metrics have labels, the majority have an empty label set (your instance), but it doesn't mean it is always the case - if under some condition the label set can be not empty we need to take it into account (create a chart per label set).

  • Some metrics do not appear in every query (e.g. consul.client.rpc, see this chart).

TL;DR need time to understand Consul better.

ilyam8 avatar Oct 06 '22 07:10 ilyam8

@ilyam8 : Yes you are right I think some of the metrics depend on the deployed mode and the existence of multiple agents in the cluster. Also, the metrics are indeed sampled every 60 seconds and sent as one of the aggregations mentioned above. We can mostly set the collector frequency to 60 seconds and simply display the value as received?

sashwathn avatar Oct 06 '22 13:10 sashwathn

@sashwathn my understanding is 10 secs, not 60

These metrics are aggregated on a ten second (10s) interval and are retained for one minute. An interval is the period of time between instances of data being collected and aggregated.

ilyam8 avatar Oct 06 '22 13:10 ilyam8

@ilyam8 : Yes, indeed. My bad. :) The prometheus collector has a collection interval of 60 seconds by default and hence the confusion.

sashwathn avatar Oct 06 '22 13:10 sashwathn

The following is implemented:

Requirement:

  • Consul with enabled Prometheus telemetry.

See the list of metrics and labels for details.


cc @ralphm

Findings:

Summaries

All timing-related metrics are exposed as summaries. See server_v1-agent-metrics.txt for example.

Summaries have MaxAge hard coded to 10 seconds.

MaxAge defines the duration for which an observation stays relevant for the summary. This only applies to pre-calculated quantiles. This does not apply to _sum and _count.

It means that all quantiles values are:

  1. for the last 10 seconds.
  2. reset to NaN every 10 seconds.
  3. remain NaN until the next (metric) observation.
  4. go to 1.

=> All quantiles are NaN for some % of the time depending on the load on the Consul cluster. => Netdata doesn't really support skipping data collection and goes by the assumption that metrics must be collected every "update_every". If not - gaps on the charts and it looks like something is wrong (UI).

I don't think we have a lot of options here but only:

  • don't collect when NaN (charts with gaps).
  • treat NaN as 0 (charts with no gaps but I guess the aggregation over time will be wrong).

Ephemeral metrics

Metrics created at run-time are considered ephemeral (not 100% sure but I think the ones not added in getPrometheusDefs).

This causes these metrics to be deleted if they haven't been updated for longer than the expiration time (controlled by prometheus_retention_time for Prometheus export).

The "Key Metrics" contain some ephemeral metrics. The lower the retention time, the lower the load on the server, and the higher the chance to have gaps.

Metrics that depend on the Consul Agent mode and leadership status

The Consul Agent has two modes:

  • server
  • client

Every Datacenter has one leader. This is a role, and it can move from one server instance to another at any given time.

  1. Some key metrics that belong (I may be wrong here, that is my understanding of the docs) to the server only or to the server leader only are exposed regardless. (e.g. raft_leader_lastContact, and raft_leader_oldestLogAge should be server leader only, but not).
  2. This is not always the case, and there are some metrics that are only exported if the instance has a certain mode or leadership status.

A question about 1.: do we check the leadership status and collect leader-only metrics only when the node is the leader? If so, then we need to obsolete and (re)create leader-specific charts for every leadership transition. The local dashboard users will lose historical data (I believe it won't be the case for Cloud UI because it is able to query archived charts).

'disable_hostname' config option seems buggy

There is the disable_hostname config option:

This controls whether or not to prepend runtime telemetry with the machine's hostname, defaults to false.

The default results in exporting duplicate metrics:

  • Only Gauges affected.
  • metrics w/o hostname have correct Help, wrong value, and no labels.
  • metrics w/ hostname have the metric name as Help and valid value.

An example from an instance with disable_hostname: false:

# HELP consul_satya_vm_server_isLeader consul_satya_vm_server_isLeader
# TYPE consul_satya_vm_server_isLeader gauge
consul_satya_vm_server_isLeader 1
# HELP consul_server_isLeader Tracks if the server is a leader.
# TYPE consul_server_isLeader gauge
consul_server_isLeader 0

# HELP consul_satya_vm_consul_members_servers consul_satya_vm_consul_members_servers
# TYPE consul_satya_vm_consul_members_servers gauge
consul_satya_vm_consul_members_servers{datacenter="us-central"} 3
# HELP consul_consul_members_servers Measures the current number of server agents registered with Consul. It is only emitted by Consul servers. Added in v1.9.6.
# TYPE consul_consul_members_servers gauge
consul_consul_members_servers 0

Just sharing, that is undocumented behavior.

ilyam8 avatar Dec 19 '22 19:12 ilyam8