trino icon indicating copy to clipboard operation
trino copied to clipboard

(WIP) Improve logging and OpenMetrics docs

Open mosabua opened this issue 1 year ago • 8 comments
trafficstars

Description

Additional context and related issues

Related to TCB 57

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.

mosabua avatar Mar 14 '24 20:03 mosabua

The slack channel recently swiftly educated me that for some metrics I need pull metrics from all nodes including workers and not just the coordinator, Prometheus supports k8s service discovery as standard and most implementations have a standard "prometheus-anotations" job configured, however in general that's unauthenticated.

I created two copies of that, one for the coordinator and one for the workers because in my case:

  • workers require the username (but not password) of a user with system information
  • coordinator requires both username and password and this can only be passed over https so must come in via the ingress (ALB in my case)

@mattstep requested I share some Prometheus configuration required to sour

Firstly in the helm for trino add the annotations

coordinator:
  annotations:
    prometheus.io/trino_scrape: "true"
worker:
  annotations:
    prometheus.io/trino_scrape: "true"

Then in the prometheus config

    - job_name: trino-metrics-worker
      scrape_interval: 10s
      scrape_timeout: 10s
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_trino_scrape]
        action: keep # scrape only pods with the trino scrape anotation
        regex: true
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: keep # dont try to scrape non trino container
        regex: trino-worker
      - action: hashmod
        modulus: $(SHARDS)
        source_labels:
        - __address__
        target_label: __tmp_hash
      - action: keep
        regex: $(SHARD)
        source_labels:
        - __tmp_hash
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: replace
        target_label: container
      metric_relabel_configs:
          - source_labels: [__name__]
            regex: ".+_FifteenMinute.+|.+_FiveMinute.+|.+IterativeOptimizer.+|.*io_airlift_http_client_type_HttpClient.+"
            action: drop # droping some highly granular metrics 
          - source_labels: [__meta_kubernetes_pod_name]
            regex: ".+"
            target_label: pod
            action: replace 
          - source_labels: [__meta_kubernetes_pod_container_name]
            regex: ".+"
            target_label: container
            action: replace 
            
      scheme: http
      tls_config:
        insecure_skip_verify: true
      basic_auth:
        username: mysuer # replace with a user with system information permission 
        # DO NOT ADD PASSWORD
    - job_name: trino-metrics-coordinator
      scrape_interval: 10s
      scrape_timeout: 10s
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_trino_scrape]
        action: keep # scrape only pods with the trino scrape anotation
        regex: true
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: keep # dont try to scrape non trino container
        regex: trino-coordinator
      - action: hashmod
        modulus: $(SHARDS)
        source_labels:
        - __address__
        target_label: __tmp_hash
      - action: keep
        regex: $(SHARD)
        source_labels:
        - __tmp_hash
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: replace
        target_label: container
      - action: replace  # overide the address to the https ingress address 
        target_label: __address__
        replacement: {{ .Values.trinourl }} 
      metric_relabel_configs:
          - source_labels: [__name__]
            regex: ".+_FifteenMinute.+|.+_FiveMinute.+|.+IterativeOptimizer.+|.*io_airlift_http_client_type_HttpClient.+"
            action: drop # droping some highly granular metrics 
          - source_labels: [__meta_kubernetes_pod_name]
            regex: ".+"
            target_label: pod
            action: replace 
          - source_labels: [__meta_kubernetes_pod_container_name]
            regex: ".+"
            target_label: container
            action: replace 
            
      scheme: https
      tls_config:
        insecure_skip_verify: true
      basic_auth:
        username: mysuer # replace with a user with system information permission 
        password_file: /some/password/file

lozbrown avatar Mar 21 '24 11:03 lozbrown

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions[bot] avatar Apr 11 '24 17:04 github-actions[bot]

This looks good to me, and I think would benefit from the more complex example from @lozbrown , this gives a more realistic configuration for a production usecase.

mattstep avatar Apr 11 '24 17:04 mattstep

I have a whole bunch of further info and details from @lozbrown and others that I will add .. just have to get back to working on this.

mosabua avatar Apr 11 '24 18:04 mosabua

I think it would be good to include a list of the things you typically monitor in your dashboard as experts.

Possibly including promql for these

lozbrown avatar Apr 11 '24 18:04 lozbrown

Agreed @lozbrown .. I will work the ones you supplied into this PR .. and if @mattstep or others have more examples.. they can be added too. We can also do more updates after this PR gets merged.

mosabua avatar Apr 11 '24 18:04 mosabua

@mosabua can we move along some sort of first attempt here on the basis something is better than completely undocumented features

lozbrown avatar May 24 '24 13:05 lozbrown

Its on my backlog but I am flat out busy .. I will try to get a minimal PR merged first and expand later at this stage.

mosabua avatar May 24 '24 15:05 mosabua

@lozbrown I have updated this PR quite a bit now, but still work some of your material into it. However you can already review now

Regarding https://github.com/trinodb/trino/pull/21089#issuecomment-2012011844 .. what do you think @nineinchnick .. is there enough docs info about this in the Helm chart docs or do we want to add more and then maybe link back to the new page from this PR?

Also @mattstep .. can you take a peek at the JSON and TCP logging stuff?

mosabua avatar Jan 15 '25 00:01 mosabua

I have raised a new issue here to request openmetrics endpoint become (configurably) anonymous https://github.com/trinodb/trino/issues/24725

This complexity is the main reason many are still using the jmx_exporter to the point its been introduces into the helm chart as it allows anonymous scrape

That would significantly reduce the complexity of scraping the metrics in autoscaling and kubernetes environments.

I still feel that the running of prometheus is outside the scope of trino documentation but possibly a link to the annotation to cause metrics to be scraped would be useful. In many cases the tooling will be using things like grafana-agent grafana alloy opentelemetry agent to push data to a more modern stack like cortex / mimir as running prometheus in the above way is somewhat uncommon now.

Other points:

  • Documentation still does not mention configuration property openmetrics.jmx-object-names=
  • Documentation could mention some suggested metrics
  • Documentation based on an unsecured cluster is misleadingly simple given the complexities described here

lozbrown avatar Jan 16 '25 11:01 lozbrown

Thank you for taking the time to review @electrum

mosabua avatar Feb 04 '25 22:02 mosabua