add-agent-monitoring-alerting
Add Monitoring & Alerting for Agent Deployments
Overview
Pertaining #142. This PR adds monitoring and alerting infrastructure to the agent-starter-pack, giving users production-grade observability out of the box. The implementation is platform-aware, agent-aware, and fully configurable through user prompts during project creation.
What's New
User-Facing Features
1. Optional Email Alert Notifications
- Interactive prompt during
agent-starter-pack createasks for an email address - If provided, alerts are delivered via email + Cloud Console
- If skipped, alerts are console-only (no email noise for dev environments)
- User sees clear confirmation of their choice with visual feedback
2. Configurable Alert Thresholds All thresholds are exposed as Terraform variables with sensible defaults:
- Latency alerts: P95 threshold (default: 3000ms)
- Error rate alerts: Error count per 5-min window (default: 10 errors)
- Retriever latency alerts (Agentic RAG only): P99 threshold (default: 10000ms)
- Agent error rate (Cloud Run): Errors per second (default: 0.5/sec)
Users can customise these in deployment/terraform/dev/vars/env.tfvars after project creation.
Infrastructure Added
Universal Log-Based Metrics (All Agents, All Platforms)
- Agent Operation Count: Tracks all agent operations with operation type labels
- Agent Error Count by Category: Categorised errors (LLM_FAILURE, TOOL_FAILURE, RETRIEVER_FAILURE, etc.)
Agentic RAG-Specific Metrics
- Retriever Latency Distribution: P50/P95/P99 retrieval performance with histogram buckets
- Document Count Distribution: Number of documents retrieved per call
- Retriever Latency Alert: Fires when P99 > threshold (default 10s)
Agent Engine (Reasoning Engine) Platform
- Latency Alert: P95 request latency monitoring using native platform metrics
- Error Rate Alert: Fires when log-based error count exceeds threshold in 5-min window
-
Dashboard: 5-7 chart dashboard including:
- Request count (requests/sec)
- Request latency (P50/P95/P99)
- CPU allocation
- Memory allocation
- Agent errors by category
- Retriever latency (Agentic RAG only)
- Documents retrieved per call (Agentic RAG only)
Cloud Run Platform
- Latency Alert: P95 request latency using Cloud Run native metrics
- 5xx Error Rate Alert: Monitors 5xx response codes
- Agent Error Alert: Log-based agent errors with rate threshold
Technical Details
Terraform Structure
- New file:
deployment/terraform/dev/monitoring.tf(757 lines) - New file:
deployment/terraform/monitoring.tf(prod equivalent) - Modified:
deployment/terraform/dev/variables.tf(added 4 monitoring variables) - Modified:
deployment/terraform/dev/vars/env.tfvars(added default threshold values) - Modified:
deployment/terraform/dev/apis.tf(addedmonitoring.googleapis.com)
Python CLI Integration
-
agent_starter_pack/cli/commands/create.py: Added interactive email prompt -
agent_starter_pack/cli/utils/template.py: Threadalert_notification_emailthrough template processing -
tests/cli/commands/test_create.py: Updated test mocks to handle new prompt
Smart Templating
- Uses Jinja2 conditionals to render appropriate resources based on:
-
cookiecutter.deployment_target(agent_engine vs cloud_run) -
cookiecutter.agent_name(agentic_rag gets extra retriever metrics)
-
- Notification channel only created if email provided (using Terraform
count)
Metric Design Decisions
Why log-based metrics for agent telemetry?
- Platform-agnostic: Works on both Agent Engine and Cloud Run
- Flexible: Can extract any JSON payload attribute from structured logs
- Extensible: Users can add custom agent metrics by logging with the right labels
Why native metrics for platform SLOs?
- Accuracy: Platform-provided metrics are the source of truth
- Performance: No additional overhead from log processing
- Consistency: Aligns with Google Cloud best practices
Alert auto-close: 30 minutes
- Prevents alert fatigue from transient issues
- Long enough to investigate without losing context
- Configurable via
alert_strategy.auto_closeif users want different behaviour
Test Coverage
All tests pass:
- ✅ 95/95 CLI tests (including new email prompt flow)
- ✅ Ruff linting
- ✅ Mypy type checking
- ✅ Import ordering fixed
Migration Notes
Existing Projects
- This is a template change only - existing deployed agents are unaffected
- Users can retrofit monitoring by:
- Copying the new
monitoring.tffiles - Adding the monitoring variables
- Running
terraform apply
- Copying the new
New Projects
- Zero additional effort required
- Users just need to answer the email prompt during creation
- Monitoring deploys automatically with the agent infrastructure
Example Usage
$ agent-starter-pack create my-agent
# ... after other prompts ...
Monitoring & Alerting Setup
Configure email notifications for production alerts (optional).
Email for alert notifications: [email protected]
✓ Alerts will be sent to: [email protected]
# Or skip it:
Email for alert notifications:
⚠ Email notifications disabled. Alerts will only appear in Cloud Console.
After deployment, users get:
- Real-time dashboards in Cloud Monitoring
- Automatic alerts when thresholds are breached
- Structured logs for debugging with
labels.service_namefiltering
Related Documentation
The monitoring infrastructure automatically creates:
- Cloud Monitoring dashboard (named "Reasoning Engine - {project_name}")
- Alert policies with descriptive documentation
- Notification channel (email) if configured
Users can find their dashboard at:
https://console.cloud.google.com/monitoring/dashboards
Checklist
- [x] Added user-facing email prompt with clear feedback
- [x] Created comprehensive monitoring.tf for both dev and prod
- [x] Added configurable threshold variables with sensible defaults
- [x] Platform-specific alerts (Agent Engine vs Cloud Run)
- [x] Agent-specific metrics (Agentic RAG retriever monitoring)
- [x] Updated CLI tests with new prompt flow
- [x] All linting and type checking passes
- [x] Tested with both email provided and skipped scenarios
Summary of Changes
Hello @saahil-mehta, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the observability of agents deployed via the starter pack by integrating robust monitoring and alerting capabilities. It provides a foundational Terraform setup for tracking agent performance, identifying errors, and receiving timely notifications, ensuring better operational insights and reliability for deployed agents.
Highlights
-
Monitoring Infrastructure: Introduced new Terraform files (
monitoring.tf) to establish comprehensive monitoring and alerting for deployed agents, including log-based metrics, alert policies, and a pre-configured Google Cloud Monitoring dashboard. -
Service Enablement: Enabled the
monitoring.googleapis.comservice inapis.tffiles to support the new monitoring features. -
Configurable Alerts: Added new Terraform variables (
variables.tf,vars/env.tfvars) to allow users to configure alert notification emails and customize thresholds for latency and error rates. -
CLI Integration: Updated the
createcommand in the CLI to interactively prompt users for an alert notification email during agent setup, streamlining the configuration process.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.
Prompt for monitoring email addition:
Passing tests:
@eliasecchig @allen-stephen
/gcbrun