garmadon
garmadon copied to clipboard
Get network statistics per container
Some of our users would like to have some network statistics from containers, like:
- nb bytes
- nb packets
- errors/drop...
To get those metrics we can rely on /prod/
FYI, Oshi can provide those information already through NetworkIF: https://github.com/oshi/oshi/blob/master/oshi-core/src/main/java/oshi/hardware/NetworkIF.java#L207
so you can add them in: https://github.com/criteo/garmadon/blob/master/jvm-statistics/core/src/main/java/com/criteo/jvm/statistics/NetworkStatistics.java
From my understanding this will provide metrics from the OS point of view not from containers, no?
@ashangit /proc/<pid>/net/dev
provides interface level stats, NOT per process!
So it doesn't change anything.
AFAICT packets & errors are at interface levels so you cannot get per process. you can get bytes recv/sent in different ways like for example at Hadoop Level but not packets or errors which don't have the information about which socket is associated with.
Ok my bad So we need to find an other way. We can't also get it from hadoop as wehave more and more "non JVM" container (python, tensorflow...)
@ashangit Don't seem to be obvious. nethogs (https://github.com/raboof/nethogs) uses libpcap to decode packet header and get length of the packet to know what is the real time bandwidth use by the process. But cannot get history and then you need to run permanently to get total qty of bytes/packets recv/sent. I don't think this is sustainable for our use case
Honestly I don't see solution to attribute network activity per process
Well after some research: I may have find a solution in ss -tinp
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 10.0.2.15:22 10.0.2.2:57375 users:(("sshd",pid=9990,fd=3))
cubic rto:201 rtt:0.241/0.054 ato:40 mss:1460 cwnd:10 bytes_acked:94885 bytes_received:43212 segs_out:303 segs_in:459 send 484.6Mbps lastsnd:757714 lastrcv:756996 lastack:756996 pacing_rate 965.8Mbps rcv_rtt:311220 rcv_space:29532
ESTAB 0 0 10.0.2.15:22 10.0.2.2:56338 users:(("sshd",pid=7363,fd=3))
cubic rto:201 rtt:0.449/0.182 ato:47 mss:1460 cwnd:10 ssthresh:16 bytes_acked:899733 bytes_received:255916 segs_out:4888 segs_in:8044 send 260.1Mbps lastsnd:15 lastrcv:16 lastack:15 pacing_rate 519.8Mbps rcv_rtt:451431 rcv_space:78608
ESTAB 0 0 10.0.2.15:58400 89.30.125.167:25 users:(("telnet",pid=15543,fd=3))
cubic rto:205 rtt:4.11/2.055 ato:40 mss:1460 cwnd:10 bytes_acked:1 bytes_received:24 segs_out:3 segs_in:2 send 28.4Mbps lastsnd:531253 lastrcv:531235 lastack:531235 pacing_rate 56.4Mbps rcv_space:29200
ESTAB 0 0 10.0.2.15:22 10.0.2.2:50323 users:(("sshd",pid=2421,fd=3))
cubic rto:201 rtt:0.654/0.244 ato:40 mss:1460 cwnd:8 ssthresh:7 bytes_acked:3058309 bytes_received:1102444 segs_out:19837 segs_in:35236 send 142.9Mbps lastsnd:531234 lastrcv:531432 lastack:531233 pacing_rate 285.6Mbps retrans:0/4 rcv_rtt:240486 rcv_space:54912
Looks to be a good startup, just have some concerns on the impact it could have on loaded servers Lets discuss about it IRL next week