Fix observability stack

Open Norbiox opened this issue 1 month ago • 1 comments

What was changed

✅ Loki 3.0 is running without errors and properly configured ✅ Prometheus successfully scrapes metrics from all Temporal services ✅ Grafana 12.2.1 is deployed with proper datasource configuration ✅ All 4 dashboards are provisioned and display Temporal metrics correctly ✅ Consistent YAML formatting across all provisioning files (camelCase field names) ✅ No compatibility or configuration errors in logs

Why?

Because of the breaking changes in Loki 3.0 mentioned in this issue. I've started with fixing the Loki configuration but it didn't helped and I still haven't seen any dashboards in Grafana.

Checklist

Closes 215
How was this tested: docker compose -f docker-compose-multirole.yaml up and then go to Grafana to ensure dashboards are working
Any docs updates needed? No

Detailed description of changes made and their explanations

1. Loki 3.0 Configuration Update

File: deployment/loki/local-config.yaml

Changes Made

1.1 Updated Schema Version (v9 → v13)

Reason: Loki 3.0 requires schema v13 or newer for proper support and modern features like structured metadata and OTLP ingestion
Change: Modified schema configuration from v9 to v13

1.2 Changed Index Store (boltdb → tsdb)

Reason: BoltDB index type is deprecated. Loki 3.0 requires TSDB (Time Series Database) as the index type for modern features
Change: Updated store: boltdb to store: tsdb in schema_config

1.3 Updated Storage Configuration

Reason: Storage backend needs to match the new index type
Changes:
- Removed boltdb storage section with directory: /tmp/loki/index
- Added tsdb_shipper configuration with:
  - active_index_directory: /tmp/loki/tsdb
  - cache_location: /tmp/loki/index_cache
  - cache_ttl: 24h

1.4 Optimized Index Period

Reason: Smaller index periods improve performance and make index management more efficient
Change: Updated index period from 168h to 24h

1.5 Added Compactor Configuration

Reason: Loki 3.0 requires compactor configuration for proper operation. Previous version was missing this

Changes: Added compactor section:

compactor:
  working_directory: /tmp/loki/compactor
  compaction_interval: 10m

1.6 Removed Deprecated Fields

Reason: These fields are no longer valid in Loki 3.0
Removed:
- enforce_metric_name from limits_config
- Entire chunk_store_config section (with max_look_back_period)
- Entire table_manager section (only used with DynamoDB)

2. Prometheus Configuration Fix

File: deployment/prometheus/config.yml

Changes Made

2.1 Updated Scrape Targets from host.docker.internal to Container Names

Reason: Prometheus runs inside Docker and the Temporal services are also running as Docker containers on the same network. Using host.docker.internal (which points to the host machine) was incorrect.
Changes: Updated targets in temporalmetrics job:
- host.docker.internal:8000 → temporal-history:8000
- host.docker.internal:8001 → temporal-matching:8001
- host.docker.internal:8002 → temporal-frontend:8002
- host.docker.internal:8003 → temporal-worker:8003
- host.docker.internal:8004 → temporal-frontend2:8004

Result: Prometheus can now successfully scrape metrics from the Temporal services, and dashboards have access to the required metrics (service_requests, service_errors, etc.)

3. Grafana Datasource Configuration Fix

File: deployment/grafana/provisioning/datasources/all.yml

Changes Made

3.1 Fixed isDefault Field Format

Reason: Grafana expects isDefault: true (camelCase) not is_default: true (snake_case)
Change: Updated Prometheus datasource from is_default: true to isDefault: true

3.2 Standardized Field Naming (org_id → orgId)

Reason: Consistency with Grafana's modern conventions and matching the dashboards provisioning file format
Changes:
- Updated org_id: 1 to orgId: 1 for both Prometheus and Loki datasources
- Added explicit isDefault: false to Loki datasource for clarity

Result: Prometheus is now properly set as the default datasource, allowing dashboards with $datasource variable to work correctly. Both provisioning files now use consistent camelCase field naming.

4. Grafana Dashboard Provisioning Update

File: deployment/grafana/provisioning/dashboards/all.yml

Changes Made

4.1 Updated to New Dashboard Provisioning Format

Reason: Grafana uses an updated dashboard provisioning configuration format with apiVersion and providers block
Changes:
- Added apiVersion: 1 at the top
- Wrapped provider config in providers: block
- Removed deprecated folder property (replaced with path)
- Updated provider fields to match new format:
  - Changed org_id to orgId
  - Added disableDeletion: false and editable: true

Result: Eliminates deprecation warnings and ensures proper dashboard provisioning in Grafana 12.2.1

5. Grafana Version Upgrade

File: docker-compose-multirole.yaml

Changes Made

5.1 Updated Grafana Image Version

Reason: Old version (7.5.16) could not parse dashboard queries designed for Grafana 8.0.4+, causing "Failed to upgrade legacy queries e.replace is not a function" errors
Change: Updated image from grafana/grafana:7.5.16 to grafana/grafana:12.2.1

Result: Dashboards now load without compatibility errors and display data correctly

6. Grafana Volume Mounts Configuration

File: docker-compose-multirole.yaml

Changes Made

6.1 Added Missing Volume Mounts

Reason: Dashboards and their provisioning configuration were not mounted to the Grafana container, so they were not loaded

Changes: Added three volume mounts to the grafana service:

volumes:
  - type: bind
    source: ./temporalio/deployment/grafana/provisioning/datasources
    target: /etc/grafana/provisioning/datasources
  - type: bind
    source: ./temporalio/deployment/grafana/provisioning/dashboards
    target: /etc/grafana/provisioning/dashboards
  - type: bind
    source: ./temporalio/deployment/grafana/dashboards
    target: /var/lib/grafana/dashboards

Result: All 4 dashboards are now properly provisioned and visible in Grafana:

Temporal Server Metrics
Temporal SDK Metrics
PostgreSQL Database
Docker Engine Metrics

Nov 04 '25 09:11 Norbiox

All committers have signed the CLA.

Nov 04 '25 09:11 CLAassistant