cf-networking-release icon indicating copy to clipboard operation
cf-networking-release copied to clipboard

Apps network usage metrics missing

Open michaelgrifalconi opened this issue 5 years ago • 6 comments

Issue

Follow up of this slack discussion

Unlike CPU, memory and disk, there is no metric available with information about network (how much data is being sent/received by the container).

Context

Silk release allows to limit the amount of network bandwidth an application can use from the host (diego-cell) but there is no way to monitor if some applications are reaching the limit and being throttled or not.

Regardless of the limit being in place or not, I think knowing how much bandwidth is used by each application is useful and should be part of the already existing container metrics.

Steps to Reproduce

Look at data available in ContainerMetric

Expected result / Possible Fix

I would expect to see information about the network. Either a counter of sent / received bytes or an average bandwidth usage for the last period.

There was an implementation of the first option on the Garden API but was abandoned due to the swith to cf-networking (see slack)

michaelgrifalconi avatar Mar 25 '19 14:03 michaelgrifalconi

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/164866818

The labels on this github issue will be updated when the story is started.

cf-gitbot avatar Mar 25 '19 14:03 cf-gitbot

Hi @michaelgrifalconi,

We took a look into the metrics that Silk currently provides, which we found in poller.go of netmon. However, these are container to container traffic metrics and not external. In addition, they're only provided per diego cell and not an application basis. Maybe this can help you for now, until this story is prioritized.

We've updated the tracker story and included some acceptance criteria. Would you mind reviewing them to let us know if we forgot anything?

Acceptance Criteria

As an operator I want to be able to view network metrics So that I can monitor if some applications are reaching the limit and being throttled or not.

Given a deployment of Cloud Foundry When I look at Explore DataDog metrics for tx_bytes Then I see graphs for each running application And when I send traffic out of the application Then I see it increase in DataDog

Given a deployment of Cloud Foundry When I look at Explore DataDog metrics for rx_bytes Then I see graphs for each running application And when I send traffic into the application Then I see it increase in DataDog

Given a deployment of Cloud Foundry with a ASGs that restrict outgoing traffic When I look at Explore DataDog metrics for tx_dropped Then I see graphs for each running application And when I send traffic out of the application Then I see it increase in DataDog

Given a deployment of Cloud Foundry with throttling When I look at Explore DataDog metrics for rx_dropped Then I see graphs for each running application And when I send traffic into the application Then I see it increase in DataDog

Thanks! @ameowlia and @barakyo, CF Networking Program Members

barakyo avatar Apr 29 '19 18:04 barakyo

Thank you for creating the story! I think it looks fine and describes well the topic.

Best, Michael

michaelgrifalconi avatar Apr 30 '19 06:04 michaelgrifalconi

We would be very interested in this sort of thing as well. We want to be able to know which apps/instances are showing high network usage across the whole env. not only to limit but also to have the ability to see if some apps are using more than they should.

The limits on bandwidth (rate, burst) are also very heavy handed as they apply to ALL instances system wide. where perhaps it would be good to be able to limit things more by org/space

andrew-edgar avatar May 02 '19 17:05 andrew-edgar

I'm also looking for this (https://github.com/cloudfoundry/silk-release/issues/16) and I would like to add here that some IaaS layer providers may charge traffic volume and these statistics would be vital for sharing costs amongst app owners. Maybe it would be good to have this as (or inside) an app usage event as well, so that it can be directly integrated into present billing mechanisms.

ionphractal avatar Jul 09 '19 07:07 ionphractal

Hey folks, we're adding the help wanted label on this issue as it is something we think makes sense but doesn't fit into our near team priorities. if anyone is interested in picking this up and wants to chat about implementation ideas, please share your thoughts here or find us in the #networking channel on CloudFoundry Slack.

mcwumbly avatar Apr 27 '20 18:04 mcwumbly

Silk provides metrics about data sent by containers. This would be iptables metrics which configure security groups for containers controlling outgoing traffic and incoming traffic from other containers. The burst and rate limit configuration options you mentioned control outgoing traffic only. All other incoming traffic is not controlled by silk. Please let us know if these metrics are enough?

mariash avatar May 15 '23 19:05 mariash

@mariash Are those container metrics like cpu/memory/disk/log-rate? I think the ones you're linking to are system component metrics.

mkocher avatar May 15 '23 20:05 mkocher

@mkocher you are right this is not for each container. We use this for the whole silk-vtep interface. I guess this package can be used to obtain data for each container interface.

mariash avatar May 15 '23 23:05 mariash

@mariash Thanks for sharing all the information in the slack

I would like to bring up the following topic for discussion: please let me know your feedback on whether it makes sense.

Do you think it makes sense to expose network metrics via usage events [1]? Currently, only memory metrics are available, which can be read only when start, stop, etc. events occur. However, for network metrics, we need real-time usage data. Therefore, would it be sensible to send events based on the configured internal from the container to obtain the network’s actual usage?

The reason I ask is that, at this point, we bill based on the memory metrics read from usage events.

However, I don’t see other metrics like CPU and disk being captured via usage events. Instead, these metrics end up in the loggregator stack. Considering this, shouldn’t network usage metrics also be included in the loggregator?

[1] http://v3-apidocs.cloudfoundry.org/version/3.138.0/index.html#app-usage-events

gowrisankar22 avatar May 16 '23 18:05 gowrisankar22

@JVecsei1 @geigerj0 - What levels of cross-release compatibility have been tested on these changes? Looks like we'd want to validate forwards/backwards compatibility between cf-networking, diego, and garden-runc releases.

geofffranks avatar Aug 02 '23 14:08 geofffranks

@geofffranks This issue can be can be closed now ?

/cc @JVecsei1 @geigerj0

gowrisankar22 avatar Aug 31 '23 12:08 gowrisankar22

Ah, yes. Sure can.

This is available by enabling the garden.enable_container_network_metrics property in conjunction with cf-networking-release 3.33.0, diego-release 2.82.0, and garden-runc-release 1.38.0. cf-deployment v32.4.0+ contains all of these releases.

geofffranks avatar Aug 31 '23 13:08 geofffranks