supavisor icon indicating copy to clipboard operation
supavisor copied to clipboard

Add initial Open Telemetry support

Open rlopzc opened this issue 1 year ago • 0 comments

What kind of change does this PR introduce?

Feature

What is the current behavior?

No current behavior.

What is the new behavior?

This PR will start sending OpenTelemetry traces to the configured OTel vendor.

My goal with this PR is to set up OpenTelemetry in the repository + a development environment to send the traces. This will lower the bar for more contributors to explore this library and play with the traces (add more traces, trace other events, trace the different libraries this project uses, etc).

The OTel vendor can be configured with the following environment variables:

  • OTEL_EXPORTER_OTLP_ENDPOINT. The endpoint to send the traces to. For example: http://localhost:4318.
  • OTEL_EXPORTER_OTLP_HEADERS. The headers to include in the request. For example: authorization="Bearer your-api-key".

What's currently traced?

  • Proxied query to the DB. It's a general trace where we can attach more information of the flow of events. This trace has more information to identify: tenant, user, mode, type, db_name, pool_pid, db_pid. (Example in the image below). Ideally, this trace will have information on all the interactions with different systems: caches, pools, partisan, client_handler, db_handler, pg_parser, etc.

As the project is complex, I didn't have enough time to deep dive into each flow of queries + edge cases. That's why this PR just traces the query sent to the proxy, and when the ClientHandler responds to the caller

Next steps

  • Understand the different flows + edge cases, and add spans to the query trace (the one traced in this PR) taking into account the distributed environment nature of the calls.
  • This project uses partisan to communicate to the DB. Partisan produces telemetry events (docs here). An idea that I have is to listen to partisan telemetry events and trace the requests sent to the DB. I need more time to think about how to share the otel_span created in the ClientHandler to the DbHandler up to the telemetry produced event.
  • Depending on the chosen OTel provider, it may support multi-tenant. For example, here are the docs for Grafana Tempo.
  • When the traces are good enough, it should be documented how anyone can enable OpenTelemetry (/docs).

Local environment producing traces

Pre-requisites:

  • Dev setup
  • Tenant created in DB, as per the linked example.

To display the traces, I chose Grafana OTel because this project already uses the great library PromEx. I figured that Grafana had been already used in the stack. Of course, this is very easy to change :).

  1. Turn on Grafana OTel collector + WebUI, run: docker compose up grafana-otel.
  2. Go to http://localhost:4300 and login with admin/admin.
  3. Turn on the development environment, run: make dev.otel
  4. Connect via the proxy with: psql postgresql://postgres.dev_tenant:postgres@localhost:6543/postgres
  5. Execute a query in psql: select * from _supavisor.tenants;
  6. Explore the traces in Explore -> Choose Tempo -> Query type: search. image

I added dev.otel to the Makefile, which adds two environment variables to send the traces to http://localhost:4318.

Visualizing the traces:

As you can see in the image, the trace shows the duration of the executed query with additional information that'll help filter traces when making queries.

swappy-20240707_142332

Additional context

Related issue: #93

Let me know what can be improved in this PR. I'll address the reviews when I have free time :slightly_smiling_face:.

rlopzc avatar Jul 07 '24 21:07 rlopzc