Add Monitoring & Alerting for Agent Deployments

Overview

Pertaining #142. This PR adds monitoring and alerting infrastructure to the agent-starter-pack, giving users production-grade observability out of the box. The implementation is platform-aware, agent-aware, and fully configurable through user prompts during project creation.

What's New

User-Facing Features

1. Optional Email Alert Notifications

Interactive prompt during agent-starter-pack create asks for an email address
If provided, alerts are delivered via email + Cloud Console
If skipped, alerts are console-only (no email noise for dev environments)
User sees clear confirmation of their choice with visual feedback

2. Configurable Alert Thresholds All thresholds are exposed as Terraform variables with sensible defaults:

Latency alerts: P95 threshold (default: 3000ms)
Error rate alerts: Error count per 5-min window (default: 10 errors)
Retriever latency alerts (Agentic RAG only): P99 threshold (default: 10000ms)
Agent error rate (Cloud Run): Errors per second (default: 0.5/sec)

Users can customise these in deployment/terraform/dev/vars/env.tfvars after project creation.

Infrastructure Added

Universal Log-Based Metrics (All Agents, All Platforms)

Agent Operation Count: Tracks all agent operations with operation type labels
Agent Error Count by Category: Categorised errors (LLM_FAILURE, TOOL_FAILURE, RETRIEVER_FAILURE, etc.)

Agentic RAG-Specific Metrics

Retriever Latency Distribution: P50/P95/P99 retrieval performance with histogram buckets
Document Count Distribution: Number of documents retrieved per call
Retriever Latency Alert: Fires when P99 > threshold (default 10s)

Agent Engine (Reasoning Engine) Platform

Latency Alert: P95 request latency monitoring using native platform metrics
Error Rate Alert: Fires when log-based error count exceeds threshold in 5-min window
Dashboard: 5-7 chart dashboard including:
- Request count (requests/sec)
- Request latency (P50/P95/P99)
- CPU allocation
- Memory allocation
- Agent errors by category
- Retriever latency (Agentic RAG only)
- Documents retrieved per call (Agentic RAG only)

Cloud Run Platform

Latency Alert: P95 request latency using Cloud Run native metrics
5xx Error Rate Alert: Monitors 5xx response codes
Agent Error Alert: Log-based agent errors with rate threshold

Technical Details

Terraform Structure

New file: deployment/terraform/dev/monitoring.tf (757 lines)
New file: deployment/terraform/monitoring.tf (prod equivalent)
Modified: deployment/terraform/dev/variables.tf (added 4 monitoring variables)
Modified: deployment/terraform/dev/vars/env.tfvars (added default threshold values)
Modified: deployment/terraform/dev/apis.tf (added monitoring.googleapis.com)

Python CLI Integration

agent_starter_pack/cli/commands/create.py: Added interactive email prompt
agent_starter_pack/cli/utils/template.py: Thread alert_notification_email through template processing
tests/cli/commands/test_create.py: Updated test mocks to handle new prompt

Smart Templating

Uses Jinja2 conditionals to render appropriate resources based on:
- cookiecutter.deployment_target (agent_engine vs cloud_run)
- cookiecutter.agent_name (agentic_rag gets extra retriever metrics)
Notification channel only created if email provided (using Terraform count)

Metric Design Decisions

Why log-based metrics for agent telemetry?

Platform-agnostic: Works on both Agent Engine and Cloud Run
Flexible: Can extract any JSON payload attribute from structured logs
Extensible: Users can add custom agent metrics by logging with the right labels

Why native metrics for platform SLOs?

Accuracy: Platform-provided metrics are the source of truth
Performance: No additional overhead from log processing
Consistency: Aligns with Google Cloud best practices

Alert auto-close: 30 minutes

Prevents alert fatigue from transient issues
Long enough to investigate without losing context
Configurable via alert_strategy.auto_close if users want different behaviour

Test Coverage

All tests pass:

✅ 95/95 CLI tests (including new email prompt flow)
✅ Ruff linting
✅ Mypy type checking
✅ Import ordering fixed

Migration Notes

Existing Projects

This is a template change only - existing deployed agents are unaffected
Users can retrofit monitoring by:
1. Copying the new monitoring.tf files
2. Adding the monitoring variables
3. Running terraform apply

New Projects

Zero additional effort required
Users just need to answer the email prompt during creation
Monitoring deploys automatically with the agent infrastructure

Example Usage

$ agent-starter-pack create my-agent

# ... after other prompts ...

Monitoring & Alerting Setup
Configure email notifications for production alerts (optional).
Email for alert notifications: [email protected]
✓ Alerts will be sent to: [email protected]

# Or skip it:
Email for alert notifications:
⚠ Email notifications disabled. Alerts will only appear in Cloud Console.

After deployment, users get:

Real-time dashboards in Cloud Monitoring
Automatic alerts when thresholds are breached
Structured logs for debugging with labels.service_name filtering

Checklist

[x] Added user-facing email prompt with clear feedback
[x] Created comprehensive monitoring.tf for both dev and prod
[x] Added configurable threshold variables with sensible defaults
[x] Platform-specific alerts (Agent Engine vs Cloud Run)
[x] Agent-specific metrics (Agentic RAG retriever monitoring)
[x] Updated CLI tests with new prompt flow
[x] All linting and type checking passes
[x] Tested with both email provided and skipped scenarios

Oct 21 '25 08:10 saahil-mehta

Summary of Changes

Hello @saahil-mehta, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability of agents deployed via the starter pack by integrating robust monitoring and alerting capabilities. It provides a foundational Terraform setup for tracking agent performance, identifying errors, and receiving timely notifications, ensuring better operational insights and reliability for deployed agents.

Highlights

Monitoring Infrastructure: Introduced new Terraform files (monitoring.tf) to establish comprehensive monitoring and alerting for deployed agents, including log-based metrics, alert policies, and a pre-configured Google Cloud Monitoring dashboard.
Service Enablement: Enabled the monitoring.googleapis.com service in apis.tf files to support the new monitoring features.
Configurable Alerts: Added new Terraform variables (variables.tf, vars/env.tfvars) to allow users to configure alert notification emails and customize thresholds for latency and error rates.
CLI Integration: Updated the create command in the CLI to interactively prompt users for an alert notification email during agent setup, streamlining the configuration process.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Oct 21 '25 08:10 gemini-code-assist[bot]

Prompt for monitoring email addition: email-prompt

Passing tests: make-res

@eliasecchig @allen-stephen

Nov 03 '25 10:11 saahil-mehta

/gcbrun

Nov 04 '25 21:11 allen-stephen

add-agent-monitoring-alerting

Add Monitoring & Alerting for Agent Deployments

Overview

What's New

User-Facing Features

Infrastructure Added

Technical Details

Metric Design Decisions

Test Coverage

Migration Notes

Example Usage

Related Documentation

Checklist

Summary of Changes

Highlights