cf-networking-release
cf-networking-release copied to clipboard
Apps network usage metrics missing
Issue
Follow up of this slack discussion
Unlike CPU, memory and disk, there is no metric available with information about network (how much data is being sent/received by the container).
Context
Silk release allows to limit the amount of network bandwidth an application can use from the host (diego-cell) but there is no way to monitor if some applications are reaching the limit and being throttled or not.
Regardless of the limit being in place or not, I think knowing how much bandwidth is used by each application is useful and should be part of the already existing container metrics.
Steps to Reproduce
Look at data available in ContainerMetric
Expected result / Possible Fix
I would expect to see information about the network. Either a counter of sent / received bytes or an average bandwidth usage for the last period.
There was an implementation of the first option on the Garden API but was abandoned due to the swith to cf-networking (see slack)
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/164866818
The labels on this github issue will be updated when the story is started.
Hi @michaelgrifalconi,
We took a look into the metrics that Silk currently provides, which we found in poller.go of netmon. However, these are container to container traffic metrics and not external. In addition, they're only provided per diego cell and not an application basis. Maybe this can help you for now, until this story is prioritized.
We've updated the tracker story and included some acceptance criteria. Would you mind reviewing them to let us know if we forgot anything?
Acceptance Criteria
As an operator I want to be able to view network metrics So that I can monitor if some applications are reaching the limit and being throttled or not.
Given a deployment of Cloud Foundry
When I look at Explore DataDog metrics for tx_bytes
Then I see graphs for each running application
And when I send traffic out of the application
Then I see it increase in DataDog
Given a deployment of Cloud Foundry
When I look at Explore DataDog metrics for rx_bytes
Then I see graphs for each running application
And when I send traffic into the application
Then I see it increase in DataDog
Given a deployment of Cloud Foundry with a ASGs that restrict outgoing traffic
When I look at Explore DataDog metrics for tx_dropped
Then I see graphs for each running application
And when I send traffic out of the application
Then I see it increase in DataDog
Given a deployment of Cloud Foundry with throttling
When I look at Explore DataDog metrics for rx_dropped
Then I see graphs for each running application
And when I send traffic into the application
Then I see it increase in DataDog
Thanks! @ameowlia and @barakyo, CF Networking Program Members
Thank you for creating the story! I think it looks fine and describes well the topic.
Best, Michael
We would be very interested in this sort of thing as well. We want to be able to know which apps/instances are showing high network usage across the whole env. not only to limit but also to have the ability to see if some apps are using more than they should.
The limits on bandwidth (rate, burst) are also very heavy handed as they apply to ALL instances system wide. where perhaps it would be good to be able to limit things more by org/space
I'm also looking for this (https://github.com/cloudfoundry/silk-release/issues/16) and I would like to add here that some IaaS layer providers may charge traffic volume and these statistics would be vital for sharing costs amongst app owners. Maybe it would be good to have this as (or inside) an app usage event as well, so that it can be directly integrated into present billing mechanisms.
Hey folks, we're adding the help wanted
label on this issue as it is something we think makes sense but doesn't fit into our near team priorities. if anyone is interested in picking this up and wants to chat about implementation ideas, please share your thoughts here or find us in the #networking
channel on CloudFoundry Slack.
Silk provides metrics about data sent by containers. This would be iptables metrics which configure security groups for containers controlling outgoing traffic and incoming traffic from other containers. The burst and rate limit configuration options you mentioned control outgoing traffic only. All other incoming traffic is not controlled by silk. Please let us know if these metrics are enough?
@mariash Are those container metrics like cpu/memory/disk/log-rate? I think the ones you're linking to are system component metrics.
@mkocher you are right this is not for each container. We use this for the whole silk-vtep interface. I guess this package can be used to obtain data for each container interface.
@mariash Thanks for sharing all the information in the slack
I would like to bring up the following topic for discussion: please let me know your feedback on whether it makes sense.
Do you think it makes sense to expose network metrics via usage events [1]? Currently, only memory metrics are available, which can be read only when start, stop, etc. events occur. However, for network metrics, we need real-time usage data. Therefore, would it be sensible to send events based on the configured internal from the container to obtain the network’s actual usage?
The reason I ask is that, at this point, we bill based on the memory metrics read from usage events.
However, I don’t see other metrics like CPU and disk being captured via usage events. Instead, these metrics end up in the loggregator stack. Considering this, shouldn’t network usage metrics also be included in the loggregator?
[1] http://v3-apidocs.cloudfoundry.org/version/3.138.0/index.html#app-usage-events
@JVecsei1 @geigerj0 - What levels of cross-release compatibility have been tested on these changes? Looks like we'd want to validate forwards/backwards compatibility between cf-networking, diego, and garden-runc releases.
@geofffranks This issue can be can be closed now ?
/cc @JVecsei1 @geigerj0
Ah, yes. Sure can.
This is available by enabling the garden.enable_container_network_metrics
property in conjunction with cf-networking-release 3.33.0, diego-release 2.82.0, and garden-runc-release 1.38.0. cf-deployment v32.4.0+ contains all of these releases.