Fix observability stack
What was changed
✅ Loki 3.0 is running without errors and properly configured ✅ Prometheus successfully scrapes metrics from all Temporal services ✅ Grafana 12.2.1 is deployed with proper datasource configuration ✅ All 4 dashboards are provisioned and display Temporal metrics correctly ✅ Consistent YAML formatting across all provisioning files (camelCase field names) ✅ No compatibility or configuration errors in logs
Why?
Because of the breaking changes in Loki 3.0 mentioned in this issue. I've started with fixing the Loki configuration but it didn't helped and I still haven't seen any dashboards in Grafana.
Checklist
-
Closes 215
-
How was this tested:
docker compose -f docker-compose-multirole.yaml upand then go to Grafana to ensure dashboards are working -
Any docs updates needed? No
Detailed description of changes made and their explanations
1. Loki 3.0 Configuration Update
File: deployment/loki/local-config.yaml
Changes Made
1.1 Updated Schema Version (v9 → v13)
- Reason: Loki 3.0 requires schema v13 or newer for proper support and modern features like structured metadata and OTLP ingestion
- Change: Modified schema configuration from
v9tov13
1.2 Changed Index Store (boltdb → tsdb)
- Reason: BoltDB index type is deprecated. Loki 3.0 requires TSDB (Time Series Database) as the index type for modern features
- Change: Updated
store: boltdbtostore: tsdbin schema_config
1.3 Updated Storage Configuration
- Reason: Storage backend needs to match the new index type
- Changes:
- Removed
boltdbstorage section withdirectory: /tmp/loki/index - Added
tsdb_shipperconfiguration with:active_index_directory: /tmp/loki/tsdbcache_location: /tmp/loki/index_cachecache_ttl: 24h
- Removed
1.4 Optimized Index Period
- Reason: Smaller index periods improve performance and make index management more efficient
- Change: Updated index period from
168hto24h
1.5 Added Compactor Configuration
- Reason: Loki 3.0 requires compactor configuration for proper operation. Previous version was missing this
- Changes: Added compactor section:
compactor: working_directory: /tmp/loki/compactor compaction_interval: 10m
1.6 Removed Deprecated Fields
- Reason: These fields are no longer valid in Loki 3.0
- Removed:
enforce_metric_namefromlimits_config- Entire
chunk_store_configsection (withmax_look_back_period) - Entire
table_managersection (only used with DynamoDB)
2. Prometheus Configuration Fix
File: deployment/prometheus/config.yml
Changes Made
2.1 Updated Scrape Targets from host.docker.internal to Container Names
- Reason: Prometheus runs inside Docker and the Temporal services are also running as Docker containers on the same network. Using
host.docker.internal(which points to the host machine) was incorrect. - Changes: Updated targets in
temporalmetricsjob:host.docker.internal:8000→temporal-history:8000host.docker.internal:8001→temporal-matching:8001host.docker.internal:8002→temporal-frontend:8002host.docker.internal:8003→temporal-worker:8003host.docker.internal:8004→temporal-frontend2:8004
Result: Prometheus can now successfully scrape metrics from the Temporal services, and dashboards have access to the required metrics (service_requests, service_errors, etc.)
3. Grafana Datasource Configuration Fix
File: deployment/grafana/provisioning/datasources/all.yml
Changes Made
3.1 Fixed isDefault Field Format
- Reason: Grafana expects
isDefault: true(camelCase) notis_default: true(snake_case) - Change: Updated Prometheus datasource from
is_default: truetoisDefault: true
3.2 Standardized Field Naming (org_id → orgId)
- Reason: Consistency with Grafana's modern conventions and matching the dashboards provisioning file format
- Changes:
- Updated
org_id: 1toorgId: 1for both Prometheus and Loki datasources - Added explicit
isDefault: falseto Loki datasource for clarity
- Updated
Result: Prometheus is now properly set as the default datasource, allowing dashboards with $datasource variable to work correctly. Both provisioning files now use consistent camelCase field naming.
4. Grafana Dashboard Provisioning Update
File: deployment/grafana/provisioning/dashboards/all.yml
Changes Made
4.1 Updated to New Dashboard Provisioning Format
- Reason: Grafana uses an updated dashboard provisioning configuration format with
apiVersionandprovidersblock - Changes:
- Added
apiVersion: 1at the top - Wrapped provider config in
providers:block - Removed deprecated
folderproperty (replaced withpath) - Updated provider fields to match new format:
- Changed
org_idtoorgId - Added
disableDeletion: falseandeditable: true
- Changed
- Added
Result: Eliminates deprecation warnings and ensures proper dashboard provisioning in Grafana 12.2.1
5. Grafana Version Upgrade
File: docker-compose-multirole.yaml
Changes Made
5.1 Updated Grafana Image Version
- Reason: Old version (7.5.16) could not parse dashboard queries designed for Grafana 8.0.4+, causing "Failed to upgrade legacy queries e.replace is not a function" errors
- Change: Updated image from
grafana/grafana:7.5.16tografana/grafana:12.2.1
Result: Dashboards now load without compatibility errors and display data correctly
6. Grafana Volume Mounts Configuration
File: docker-compose-multirole.yaml
Changes Made
6.1 Added Missing Volume Mounts
- Reason: Dashboards and their provisioning configuration were not mounted to the Grafana container, so they were not loaded
- Changes: Added three volume mounts to the grafana service:
volumes: - type: bind source: ./temporalio/deployment/grafana/provisioning/datasources target: /etc/grafana/provisioning/datasources - type: bind source: ./temporalio/deployment/grafana/provisioning/dashboards target: /etc/grafana/provisioning/dashboards - type: bind source: ./temporalio/deployment/grafana/dashboards target: /var/lib/grafana/dashboards
Result: All 4 dashboards are now properly provisioned and visible in Grafana:
- Temporal Server Metrics
- Temporal SDK Metrics
- PostgreSQL Database
- Docker Engine Metrics