cosmo icon indicating copy to clipboard operation
cosmo copied to clipboard

feat: prometheus sink for gqlmetrics, ensure all operation are tracked efficiently

Open StarpTech opened this issue 2 weeks ago • 3 comments

This PR ensures that we never dismiss operation usage due to sampling. We eliminated sampling entirely while maintaining the same CPU util efficiency. Increase in total alloc memory is expected. With this change operations with low occurrence are always tracked.

I achieved this by making the gqlmetrics batch collector generic and use it for both implementations.

Fixes ENG-8486

CPU

CleanShot 2025-11-13 at 23 36 01@2x

Memory

CleanShot 2025-11-13 at 23 38 28@2x

Config

telemetry:
  metrics:
    otlp:
      enabled: false
    prometheus:
      schema_usage:
        enabled: true
        exporter:
          batch_size: 4096
          export_timeout: 10s
          interval: 10s
          queue_size: 10240

Summary by CodeRabbit

  • New Features

    • Schema field usage now tracks all GraphQL requests (sampling removed) and adds configurable exporter settings for batching, queueing, flush interval, and export timeout.
    • New pluggable metrics exporters and sinks for Prometheus and GraphQL metrics with synchronous/asynchronous recording and graceful shutdown.
    • Added cached per-operation field metrics to reduce repeated work.
  • Tests

    • Improved reliability with time-based flush waits, exporter-driven scenarios, dynamic assertions, aggregation-focused expectations, and new benchmarks for exporter throughput and buffering.

Checklist

  • [ ] I have discussed my proposed changes in an issue and have received approval to proceed.
  • [ ] I have followed the coding standards of the project.
  • [ ] Tests or benchmarks have been added or updated.
  • [ ] Documentation has been updated on https://github.com/wundergraph/cosmo-docs.
  • [ ] I have read the Contributors Guide.

StarpTech avatar Nov 13 '25 22:11 StarpTech