telegraf
telegraf copied to clipboard
Monitor application socket buffers
Feature Request
Opening a feature request kicks off a discussion.
Proposal:
Telegraf could monitor application socket send/recv buffer sizes.
Current behavior:
No such feature
Desired behavior:
Such a feature
Use case: [Why is this important (helps with prioritizing requests)]
The thoughts on this are that if there is some sort of congestion somewhere, the buffers will start filling up. On a local application, if the application isn't processing incoming data fast enough, the receive buffer will start to fill up. If the remote application isn't receiving fast enough, or if there is network congestion, the send buffer will start filling up.
These numbers are visible in the Recv-Q
and Send-Q
columns in netstat
output. Also in /proc/net/tcp
tx_queue
/rx_queue
.
The sticky part is how we want to monitor this, especially without causing cardinality explosion, since these buffers are tracked on a per-socket basis. My original thought was to make this part of the procstat input, but it's not a one-to-one relationship. And I don't like the idea of aggregating as when there are multiple connections to various endpoints, only one of them may be an issue. So then the next thought is a measurement for all network connections, and then a field which contains the PID using the connection, plus fields for addrs/ports. But if PID, addrs & ports are fields not tags (preventing cardinality explosion), we don't have a tag which will let us perform grouping & aggregations in InfluxDB.
My only current thought is a connection index pool. Basically a pool of numbers, and every time a new connection is seen, we grab a number from the pool (if the pool is empty, create a new number as size of pool + 1), and that uniquely identifies the connection across polling intervals. Once the connection goes away, telegraf returns that number to the pool.
Seems like this would also relate to #3039
If we can use /proc/net/tcp
will probably be cheaper than calling netstat, otherwise I guess we should try using ss
and iproute2
utilities.
The connection index pool might work okay in place of addr/ports. I imagine many would rather give up per connection metrics for per process, in order to reduce cardinality, maybe we start with this?
For dealing with pids, I feel like we just need to something fundamentally similar to what we have in procstat but much better. You define a query and the name to map to it.
If we can use /proc/net/tcp will probably be cheaper than calling netstat, otherwise I guess we should try using ss and iproute2 utilities.
I was just showing where you could see the numbers. I personally would detest telegraf shelling out to external utilities to gather this information.
The connection index pool might work okay in place of addr/ports. I imagine many would rather give up per connection metrics for per process, in order to reduce cardinality, maybe we start with this?
For my use case I would not be able to use this. The objective is to know when there is congestion somewhere. If I have 999 clients with a 0-length buffer, and 1 client with a non-0-length buffer, any sort of average, percentile, etc, isn't going to indicate an issue.
@phemmer planning to implement this and wanted to confirm my planned metric format... When enabling this feature, I would emit a new metric series in the form (line-protocol format)
prostat_netstat,host=prash-laptop,pattern=influxd,process_name=influxd,user=root,proto=tcp,status=listen local_addr="127.0.0.1",local_port=8086u,remote_addr="192.168.0.1",remote_port=63012u,tx_queue=0u,rx_queue=0u,timeout=0u <timestamp>
Would that work for you? I plan to allow for config filter-settings for the protocol type and the state...
The problem with that format is going to be the key. If the application has 2 open sockets, they're going to overwrite each other.
That's what all this was about in the original report:
The sticky part is how we want to monitor this, especially without causing cardinality explosion, since these buffers are tracked on a per-socket basis.
My original thought was to make this part of the procstat input, but it's not a one-to-one relationship. And I don't like the idea of aggregating as when there are multiple connections to various endpoints, only one of them may be an issue.
So then the next thought is a measurement for all network connections, and then a field which contains the PID using the connection, plus fields for addrs/ports. But if PID, addrs & ports are fields not tags (preventing cardinality explosion), we don't have a tag which will let us perform grouping & aggregations in InfluxDB.My only current thought is a connection index pool. Basically a pool of numbers, and every time a new connection is seen, we grab a number from the pool (if the pool is empty, create a new number as size of pool + 1), and that uniquely identifies the connection across polling intervals. Once the connection goes away, telegraf returns that number to the pool.
@phemmer yeah I know, so you need to use the converter
processor to choose the fields that should be tags. This is done to avoid the cardinality explosion. You still get multiple metrics, one per socket/connection in Telegraf but you need additional handling if you want to send that to e.g. InfluxDB. This can be aggregation, or indexing as you suggest or something else to make metrics distinguishable...
I don't know. I don't have a solution which makes me feel all warm and fuzzy. Even if the data is in telegraf without being de-duped, I don't know that there's much use to that. Telegraf doesn't have the advanced capabilities for doing analysis and aggregation that you can do once the data is in a database of some sort. Yes you could feed it through an external custom processor, but at that point I might question why have telegraf gather the metrics, as if you have to create a custom processor, why not have it just gather the metrics itself, then feed that to telegraf?
Since I'm obviously not able to come to a decision, I'd say go for whatever you want.
Will try to implement that and then think about some kind of processor to do the indexing that you suggested earlier. I think that is a good idea anyway...
The fact that Telegraf does not squash the metrics is good as your database might be capable of just inserting the raw data like into separate rows. The same for JSON output (or the like), you will get the metric gathered without dedup...
Thanks for your thoughts and comments! Very much appreciated!
@phemmer please test the binary in PR #15423 and let me know if this is what you intended!