amazon-linux-2023 icon indicating copy to clipboard operation
amazon-linux-2023 copied to clipboard

[Bug] - Performance much lower than AL2

Open dangee1705 opened this issue 1 month ago • 3 comments

Describe the bug We are running a set of Apache + PHP webservers on an ECS Cluster backed by an auto scaling group with t3.micros. For several months we have been running one of the ECS optimised AMIs provided by AWS, specifically amzn2-ami-ecs-kernel-5.10-hvm-2.0.20250515-x86_64-ebs. Last week we upgraded the AMI to al2023-ami-ecs-hvm-2023.0.20251108-kernel-6.1-x86_64 and immediately saw a measurable decrease in API performance as measured by CPU usage and response time.

To Reproduce Run an apache server on the two AMIs and measure response time and CPU usage.

Expected behavior Performance metrics should not be decreasing

Screenshots

Image (upgrade applied on the 13th, reverted on the 16th) Image (from a target group of one of the API services running on the cluster, notice the huge increase in response time)

Additional context Please let me know if any additional information is required. Thanks in advance

dangee1705 avatar Nov 18 '25 09:11 dangee1705

Are the requests over https? Does the PHP application use any specific modules/libraries or connect to remote resources? Do you have any resource constraints set on your ECS containers? Is the container also AL2023?

stewartsmith avatar Nov 18 '25 18:11 stewartsmith

Requests are over https, but terminated at the application load balancer, then just http through to the tasks. PHP is just using the default PHP PDO library to connect to our DB (MySQL Aurora RDS), and other than that the AWS PHP SDK to access to other AWS services eg S3, SNS and Lambda depending on the endpoint. Resource constraints, we are running on t3.micro instances, one task per instance, I believe 800MiB RAM and the full 2vCPU. The container is running Ubuntu 24.04 LTS. Thanks

dangee1705 avatar Nov 18 '25 20:11 dangee1705

Just to make sure we understand the context correctly ... All of the application level code is running inside the Ubuntu container which is identical between the good and bad case, the main difference being the "host" operating system, right ? That would tend to point the light towards the 6.1 kernel (and associated container-related devices ... is it using veth to bridge networking into the containers ? Sorry, I'm not super familiar with how ECS works).

Do you have a way to setup some kind of repro setup (with a synthetic/test workload), ie, non-production, to experiment with things a bit ?

It would be useful to see if the instance memory impacts the performances, also kernel 6.1 vs 6.12. I've asked our kernel engineers to also have a look see if they can suggest something, but if you could provide us with some kind of minimal reprocase (maybe dummy web server in the container) that we could use to diagnose, that would definitely speed things up.

ozbenh avatar Nov 19 '25 22:11 ozbenh