amazon-ecs-agent icon indicating copy to clipboard operation
amazon-ecs-agent copied to clipboard

ECS task metadata API `stats` endpoint undercounts bytes transmitted/received on EC2 with `awsvpc` network mode

Open isker opened this issue 7 months ago • 6 comments

I used this CDK stack to launch two ECS tasks, one running on Fargate and one on EC2 (with awsvpc network mode, to best mirror the Fargate task). Each task is running an alpine container that I could use to ECS Exec into the task and get a shell.

In each task, I then installed iperf3 and jq using apk add iperf3 jq. I then used iperf3 to send one gigabyte of TCP packets from one task to the other, and vice versa: the receiving task runs iperf3 -s to start a server, and the sending task runs iperf3 -c $OTHER_TASK_IPV4_ADDR -n 1G to send 1GB to the receiving task.

In each task, I then printed the alpine container's network stats using the ECS task metadata API: wget -q -O- ${ECS_CONTAINER_METADATA_URI_V4}/stats | jq .networks.

The Fargate task reports reasonable numbers, namely a bit over 1GB transmitted and received (we of course have done other things in these tasks beyond just invoke iperf3):

# wget -q -O- ${ECS_CONTAINER_METADATA_URI_V4}/stats | jq .networks
{
  "eth1": {
    "rx_bytes": 1096752652,
    "rx_packets": 159456,
    "rx_errors": 0,
    "rx_dropped": 0,
    "tx_bytes": 1083080110,
    "tx_packets": 139570,
    "tx_errors": 0,
    "tx_dropped": 0
  }
}

The EC2 task does not report expected numbers:

# wget -q -O- ${ECS_CONTAINER_METADATA_URI_V4}/stats | jq .networks
{
  "eth0": {
    "rx_bytes": 361389142,
    "rx_packets": 46892,
    "rx_errors": 0,
    "rx_dropped": 0,
    "tx_bytes": 361363480,
    "tx_packets": 51774,
    "tx_errors": 0,
    "tx_dropped": 0
  }
}

That is ~361MB transmitted and received, a substantial undercount.

The discrepancy reported by this synthetic test aligns with real observed conditions we have seen reported by the same API endpoints for services we are running in ECS. We have been comparing Fargate and EC2 to determine which is better to run our workloads. We rely on these container stats as consumed by the Prometheus ECS exporter to monitor our services, so the fact that the EC2 data is meaningfully incorrect makes this a challenge. Please fix the EC2 network stats.

isker avatar May 04 '25 03:05 isker

I repeated the experiment with the EC2 task using the bridge network mode, instead of awsvpc. This time its numbers look good:

# wget -q -O- ${ECS_CONTAINER_METADATA_URI_V4}/stats | jq .networks
{
  "eth0": {
    "rx_bytes": 1081398828,
    "rx_packets": 68893,
    "rx_errors": 0,
    "rx_dropped": 0,
    "tx_bytes": 1078211621,
    "tx_packets": 48198,
    "tx_errors": 0,
    "tx_dropped": 0
  }
}

So it seems that the bug is specifically with awsvpc on EC2. I will update the issue title to specify this.

isker avatar May 04 '25 03:05 isker

Thanks for providing detailed repro steps and reporting this issue, one thing I notice is that we appear to have some sort of special handling of network stats in our engine when using awsvpc network mode: https://github.com/aws/amazon-ecs-agent/blob/master/agent/stats/engine.go#L1013

Likely this has something to do with the fact that we can't get our network stats directly from docker when using awsvpc network mode. We get them via this getAWSVPCNetworkStats function. I would guess there might be something wrong with the way we are counting metrics in this getNetworkStatistics function.

sparrc avatar May 07 '25 23:05 sparrc

I see you are using AL2023 ARM for the CDK stack. Have you tried testing on AL2? Do you see the same behavior there?

sparrc avatar May 07 '25 23:05 sparrc

Thanks for looking into this. AL2023+ARM is what we'd be using in production so it's what I reproduced with. I will try with AL2.

isker avatar May 08 '25 01:05 isker

AL2+ARM demonstrates the same problem as AL2023+ARM on awsvpc.

isker avatar May 08 '25 04:05 isker

Let me know if you need any more information.

isker avatar May 13 '25 01:05 isker

@sparrc my intuition would be to just delete the division here:

https://github.com/aws/amazon-ecs-agent/blob/b3258c6f0a26c5b1b7d8a74ca191dca1a2e6fb55/agent/stats/task_linux.go#L107-L114

In the CDK stack I linked in the description, there are two containers added, but also there is the magic extra ~pause container that gets added to everything. If you multiply the incorrect EC2 numbers by 3, they suddenly look good.

What do you think?

isker avatar Aug 22 '25 19:08 isker