elastic-ci-stack-for-aws
elastic-ci-stack-for-aws copied to clipboard
CreateLogGroup service limits
The elastic stack exports logs to cloudwatch logs. The official aws cloudwatch logs exporter seems to call CreateLogGroup for each exported log on each host as it boots, and for some customers this is leading to hitting service limits and creating elastic stacks failing
It looks like we're using the awslogs cloudwatch logs agent: https://github.com/buildkite/elastic-ci-stack-for-aws/blob/v4-development/packer/scripts/install-awslogs.sh https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/UsePreviousCloudWatchLogsAgent.html and configuring it to pump files up to log groups in a pretty conventional way: https://github.com/buildkite/elastic-ci-stack-for-aws/blob/v4-development/packer/conf/awslogs/awslogs.conf
We're exporting groups including:
- /buildkite/buildkite-agent
- /buildkite/cfn-init
- /buildkite/cloud-init
- /buildkite/cloud-init/output
- /buildkite/docker-daemon
- /buildkite/elastic-stack
- /buildkite/elastic-stack-init
- /buildkite/lifecycled
- /buildkite/system
The reference docs suggest that this will only create the log group if it doesn't exist: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html
The implementation which has helpfully been uploaded here seems to be always create, and swallow errors: https://github.com/jinty/awscli-cwlogs-debian/blob/c8e4a1d5a0d9ec771581967e4de63407b8d0e9ac/cwlogs/push.py#L1314-L1324
But the call is made, so the limits are utilized.
Relevant?
https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-cloudwatch-agent-now-open-source-and-included-with-amazon-linux-2/
fwiw, I just ran into this which may be the same issue?

@albertywu We've just merged #811 which we believe might help, however we haven't confirmed that directly.
Buildkite runs the latest master branch of the elastic stack so we'll dogfood the change, however we don't run enough agents to hit the CreateLogGroup
quota. I wonder if you have a way to test the new agent prior to us releasing a new version of the stack (probably 5.3.0) so we can confirm the issue is resolved?