cdk-monitoring-constructs
cdk-monitoring-constructs copied to clipboard
[batch] Add Support for AWS Batch
Feature scope
AWS Batch
Describe your suggested feature
Feature request is for an AWS Batch Monitoring construct
Do you have particular alarms and dashboard widgets that you think would make sense for Batch users?
Do you have particular alarms and dashboard widgets that you think would make sense for Batch users?
The most basic requirement would be widgets which show the number of Batch Job instances in any given status (SUBMITTED, PENDING, RUNNABLE, STARTING, RUNNING, SUCCEEDED, FAILED) for a given Job Queue or Job Definition.
However, I do understand this would likely be a large effort given that these metrics are currently not even sent to CloudWatch (i.e. there's no Batch CW namespace--no native metrics or CW integration). I have seen this solved before via EventBridge rules which route Batch Job State Change event detail types to an SNS Topic target, and from there you can track the AWS/SNS namespace "NumberOfMessagesPublished" metric. Although, this is somewhat of heuristic as it tells you how many jobs entered a given state during a period as opposed to how many jobs are in a given state. Regardless, it would be nice to have a construct that takes care of all this heavy lifting for you via .monitorBatchJob(..). It would also be nice to add a dimension of EC2 Instance Type, so you can see how workloads are spread across the instances configured on the Batch ComputeEnvironment.
Beyond that, it would be nice to have basic CPU/GPU (mem/util) metric widgets from the nodes on the underlying ECS/EKS cluster powering the Batch ComputeEnvironment.
I've built something like this within my team, but unfortunately it's not clear to me how to contribute something that's backed by a custom Lambda to this repo, as everything seems to rely on AWS exposing the metrics.
You basically have 2 ways of getting metrics for AWS Batch:
- consuming the events they publish (basically just Batch State transition, which excludes
SUBMITTED)- to get
SUBMITTEDI added a separate EventBridge rule that listens to CloudTrailSubmitJobAPI calls that were successful - lightweight and real-time, but as you say - you can't see aggregations like "Total number of jobs in state X at this time"
- to get
- running a scheduled Lambda, that does API calls listing the jobs in each queue, and publishing some stats
- scheduled + the most frequent EventBridge can trigger at is once per minute
I also ran into some weirdness, which would raise some eyebrows if I were to try to contribute this, e.g. there is no way to limit the scope of ListJobs to a particular queue, Batch requires you to give access to all job queues (resource: *) to list jobs in one queue.
For the event-based thing, I'm creating a Lambda and publishing metrics to a custom namespace, just to make it easier to discover in CloudWatch. You can also avoid this, but you don't need the SNS topic I reckon, you can alarm on the number of times a rule was triggered as well.
WRT the resource utilization widgets, you can use Batch ContainerInsights (although these need to be flipped on manually or via a custom resource: https://github.com/aws/aws-cdk/issues/21698)
However, I do understand this would likely be a large effort given that these metrics are currently not even sent to CloudWatch
I'd encourage you to reach out to TAM/support contacts so that they can capture the datapoint about the customer request for the Batch team to help prioritize it.
unfortunately it's not clear to me how to contribute something that's backed by a custom Lambda to this repo
There's some very basic stuff in this folder that ultimately gets used elsewhere in the repo, but it's far from a robust setup. SecretsManagerMetricsPublisher is a somewhat similar idea that runs hourly to emit some custom metrics.
@echeung-amzn Thanks for the pointer Eugene. To be consistent, I'd need to adapt my solution a bit, but no worries. Currently my Lambda:
- Is written in python
- I can rewrite this to
.js
- I can rewrite this to
- Writes metrics using EMF, so we can do analysis on the metadata fields in CW Insights
- I guess we'd expect this module to call CW Metrics directly?
- I recall the API had some pretty strict TPS limit (but it seems it's now 500 per second and each request can take up-to 1000 metrics, which is probably more than enough. I'm sure there are some customers who run 10'000s jobs in parallel, at which point we might run into issues)
- Has a dependency on AWS Lambda Powertools for the handy EMF abstraction
- won't be needed if we call the CW API directly
Does that sound about right?
Writes metrics using EMF [...] I guess we'd expect this module to call CW Metrics directly?
I don't feel strongly about this, it'd be more of question of cost benefit. As you mention later, it's simpler with the current setup to just call AWS SDK APIs at least.
Has a dependency on AWS Lambda Powertools for the handy EMF abstraction
That's definitely a downside of the current repo setup since the handler code is just super basic with no build process involved.
Ok, I'll avoid EMF and powertools. (For reference - you can just attach the official powertools layer, no builds involved, but having 0 non-lambda runtime deps would be best for this repo I agree)