veneur Add function to expose /proc/softnet

Summary

On Linux systems, we have access to network performance data at /proc/softnet_stat. This allows us to do things like monitor the number of packets we're dropping.

This data is almost wholly undocumented, except in the source code. The packagecloud blog has a great post that goes into more detail.

/proc/softnet_stat tells us whether the NIC is out of memory. This contrasts with /proc/net/snmp, which gives us InErrors, which tells us when the socket's receive buffer is full.

If needed, the latter can be configured with sysctl (net.core.rmem_{max,default}). This is what we're doing when we set unix.SetsockoptInt from within Veneur.

Motivation

Verify that UDP network traffic is not a bottleneck

Test plan

Rollout/monitoring/revert plan

r? @cory-stripe cc @stripe/observability

Oct 17 '17 21:10 ChimeraCoder

Gerald Rule: Copy Observability on Veneur and Unilog pull requests

cc @stripe/observability cc @stripe/observability-stripe

Oct 17 '17 21:10 stripe-ci

Neat! Buuuut, this doesn't emit any metrics? (yet?)

Oct 17 '17 21:10 cory-stripe

Ah yeah, I should have tagged this as [wip]. This doesn't actually expose the metrics yet; it just adds the function that does the hard part.

Oct 17 '17 21:10 ChimeraCoder

Wouldn't this be better as an agent check rather than adding it in Veneur?

Oct 23 '17 14:10 cory-stripe

I don't think so. This is important enough to Veneur itself that we should provide the means for this functionality whether or not Veneur is running alongside the Datadog agent.

Dec 19 '17 17:12 ChimeraCoder

I'm unsure why this data merits being here but others don't. This feels like a very steep, slippery slope into being a system agent…

Dec 19 '17 17:12 cory-stripe

Here's the way I look at the dividing line: Veneur should be able to diagnose (but not fix) its own health without external involvement. In other words, we should be able to have nothing more than a process supervisor running Veneur (which detects process crashes), and Veneur itself is responsible for coming up with the correct answers to questions like "am I unhealthy, overloaded, or otherwise not behaving properly?". If the answer to those is "yes", we can then use external systems to mitigate automatically. The easiest way we do this already today is to crash the process (and have daemontools/systemd pick up the rest), but obviously there are more advanced options as well. (Veneur is not responsible for its own controllability - it's only responsible for coming up with the answers to those questions).

This is more or less the philosophy of the Operator principle in Kubernetes - while we aren't moving (everything) to Kubernetes (yet), we are adopting the Operator principle more or less across our infrastructure even where we're not running Kubernetes.

So, the reason we want /proc/softnet_stat is because it tells us whether Veneur itself is actually working properly. In an ideal world, this would allow us to have a more sophisticated response to /healthcheck - if we're dropping a lot of packets, Veneur could start failing its own healthcheck, causing us to replace it. I'd rather not stick that in the Datadog agent, because

(a) we don't want to tie Veneur needlessly to the Datadog agent; and

(b) that prevents us from being able to leverage this data in Veneur's /healthcheck - Kubernetes gives us a lot of stuff "for free" if we're able to access the data there.

Dec 19 '17 17:12 ChimeraCoder

Ok, this is definitely convincing. In theory this provides us a semblance of utilization… but UDP data is trending away in the future from our primary source of load in favor of domain sockets and streams.

What data would we peek at to get this for TCP? Are there other metrics we consider influential to health in this regard?

I'm being pedantic here because this takes us into new territory and opens the door for other things. Should veneur monitor disk fullness or other types of things that emit to dmesg? This is really interesting to me but seems ripe for a lot of code duplication and half implementations if duplicated across various services.

Dec 19 '17 18:12 cory-stripe

In theory this provides us a semblance of utilization… but UDP data is trending away in the future from our primary source of load in favor of domain sockets and streams.

Sure - if there's an equivalent for domain sockets (buffer usage?), we should add that too!

What data would we peek at to get this for TCP?

Offhand, I'm not really sure what the equivalent measure is for TCP.

Should veneur monitor disk fullness or other types of things that emit to dmesg?

No - veneur doesn't (directly) write to disk. Even if it did, we wouldn't want to track that independently of the errors it would presumably report when attempting to write.

The reason that these packet drops are important is because Veneur is a bit of a special case - it's consuming the observability primitives themselves, so it needs to be able to know if it's systematically missing any.

In general, I don't think services should be reading /proc/softnet_stat, but if Veneur doesn't, there's no way for us to know if the data it's reporting from other services is secretly incomplete.

In a pure-Kubernetes world, you're right that we would also probably want a DaemonSet that read this data on every node, to aid in scheduling pods on nodes (as opposed to scaling application pods). But we're not in a pure-Kubernetes world, and also that's still at most two places (so, some duplication, but not a ton). And I think that DaemonSet would be something like veneur-prometheus, so it could share the code anyway.

Dec 19 '17 22:12 ChimeraCoder

Has anyone reported this as a problem though? We don't actively monitor this, but we do have some charts for it. I don't recall it ever being a thing in Veneur.

I agree with your reasons for this, but it feels weird to have one of our ingress mechanisms measured in this way but none of the others. For example we know of active TCP use in the wild. We don't have this problem in our infrastructure as seen by our passive monitoring. It seems that network stack health is a fundamental health question for any network based service and that one would be better served using the system agent's mechanisms to monitor and emit this.

Separately @asf-stripe and I have been discussing high frequency side-band evaluation of arbitrary metrics from a Veneur instance for health. For example if a system agent emitted this data into a local veneur and there were a mechanism in Veneur to "peek" into that data locally — think of Prometheus' local endpoint for scraping — then a separate process on the host could react to failed SLOs for network stack signals and mark health.

In other words, I guess I still feel like this is a very system agenty thing. :/

Dec 20 '17 16:12 cory-stripe

(continuing from Slack)

Has anyone reported this as a problem though? We don't actively monitor this, but we do have some charts for it. I don't recall it ever being a thing in Veneur.

Yup - I originally wrote this up because of an issue we were investigating, involving hypothesized overloaded instances.

Separately @asf-stripe and I have been discussing high frequency side-band evaluation of arbitrary metrics from a Veneur instance for health.

We could do that, but it's orthogonal to the real problem at hand: how does an actual instance of Veneur know that it's overloaded? It can't rely on another service to feed it those metrics, because those metrics would be dropped if it's overloaded.

It seems that network stack health is a fundamental health question for any network based service and that one would be better served using the system agent's mechanisms to monitor and emit this.

Veneur's in a special place because it's intended to run on every pod, and, in addition, if it gets overloaded, then we lose all visibility into the pod.

This may make more sense once I get around to writing up the "Veneur + Kubernetes" post, but in short: Veneur needs to be able to evaluate the health of itself within a pod, without talking to any external services. We could (optionally) have system agents checking this information as well, but we can't rely on them for determining if Veneur itself is overloaded or not.

Dec 20 '17 22:12 ChimeraCoder

Ok. This still makes my SRE brain itch, but i'm in FUDland now and that's not helpful.

Dec 20 '17 22:12 cory-stripe

veneur
veneur copied to clipboard

Add function to expose /proc/softnet_stat data

Summary

Motivation

Test plan

Rollout/monitoring/revert plan

Gerald Rule: Copy Observability on Veneur and Unilog pull requests

veneur veneur copied to clipboard

Add function to expose /proc/softnet_stat data

Summary

Motivation

Test plan

Rollout/monitoring/revert plan

Gerald Rule: Copy Observability on Veneur and Unilog pull requests

veneur
veneur copied to clipboard