trino
trino copied to clipboard
(WIP) Improve logging and OpenMetrics docs
Description
Additional context and related issues
Related to TCB 57
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
The slack channel recently swiftly educated me that for some metrics I need pull metrics from all nodes including workers and not just the coordinator, Prometheus supports k8s service discovery as standard and most implementations have a standard "prometheus-anotations" job configured, however in general that's unauthenticated.
I created two copies of that, one for the coordinator and one for the workers because in my case:
- workers require the username (but not password) of a user with system information
- coordinator requires both username and password and this can only be passed over https so must come in via the ingress (ALB in my case)
@mattstep requested I share some Prometheus configuration required to sour
Firstly in the helm for trino add the annotations
coordinator:
annotations:
prometheus.io/trino_scrape: "true"
worker:
annotations:
prometheus.io/trino_scrape: "true"
Then in the prometheus config
- job_name: trino-metrics-worker
scrape_interval: 10s
scrape_timeout: 10s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_trino_scrape]
action: keep # scrape only pods with the trino scrape anotation
regex: true
- source_labels: [__meta_kubernetes_pod_container_name]
action: keep # dont try to scrape non trino container
regex: trino-worker
- action: hashmod
modulus: $(SHARDS)
source_labels:
- __address__
target_label: __tmp_hash
- action: keep
regex: $(SHARD)
source_labels:
- __tmp_hash
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: container
metric_relabel_configs:
- source_labels: [__name__]
regex: ".+_FifteenMinute.+|.+_FiveMinute.+|.+IterativeOptimizer.+|.*io_airlift_http_client_type_HttpClient.+"
action: drop # droping some highly granular metrics
- source_labels: [__meta_kubernetes_pod_name]
regex: ".+"
target_label: pod
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
regex: ".+"
target_label: container
action: replace
scheme: http
tls_config:
insecure_skip_verify: true
basic_auth:
username: mysuer # replace with a user with system information permission
# DO NOT ADD PASSWORD
- job_name: trino-metrics-coordinator
scrape_interval: 10s
scrape_timeout: 10s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_trino_scrape]
action: keep # scrape only pods with the trino scrape anotation
regex: true
- source_labels: [__meta_kubernetes_pod_container_name]
action: keep # dont try to scrape non trino container
regex: trino-coordinator
- action: hashmod
modulus: $(SHARDS)
source_labels:
- __address__
target_label: __tmp_hash
- action: keep
regex: $(SHARD)
source_labels:
- __tmp_hash
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_pod_container_name]
action: replace
target_label: container
- action: replace # overide the address to the https ingress address
target_label: __address__
replacement: {{ .Values.trinourl }}
metric_relabel_configs:
- source_labels: [__name__]
regex: ".+_FifteenMinute.+|.+_FiveMinute.+|.+IterativeOptimizer.+|.*io_airlift_http_client_type_HttpClient.+"
action: drop # droping some highly granular metrics
- source_labels: [__meta_kubernetes_pod_name]
regex: ".+"
target_label: pod
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
regex: ".+"
target_label: container
action: replace
scheme: https
tls_config:
insecure_skip_verify: true
basic_auth:
username: mysuer # replace with a user with system information permission
password_file: /some/password/file
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua
This looks good to me, and I think would benefit from the more complex example from @lozbrown , this gives a more realistic configuration for a production usecase.
I have a whole bunch of further info and details from @lozbrown and others that I will add .. just have to get back to working on this.
I think it would be good to include a list of the things you typically monitor in your dashboard as experts.
Possibly including promql for these
Agreed @lozbrown .. I will work the ones you supplied into this PR .. and if @mattstep or others have more examples.. they can be added too. We can also do more updates after this PR gets merged.
@mosabua can we move along some sort of first attempt here on the basis something is better than completely undocumented features
Its on my backlog but I am flat out busy .. I will try to get a minimal PR merged first and expand later at this stage.
@lozbrown I have updated this PR quite a bit now, but still work some of your material into it. However you can already review now
Regarding https://github.com/trinodb/trino/pull/21089#issuecomment-2012011844 .. what do you think @nineinchnick .. is there enough docs info about this in the Helm chart docs or do we want to add more and then maybe link back to the new page from this PR?
Also @mattstep .. can you take a peek at the JSON and TCP logging stuff?
I have raised a new issue here to request openmetrics endpoint become (configurably) anonymous https://github.com/trinodb/trino/issues/24725
This complexity is the main reason many are still using the jmx_exporter to the point its been introduces into the helm chart as it allows anonymous scrape
That would significantly reduce the complexity of scraping the metrics in autoscaling and kubernetes environments.
I still feel that the running of prometheus is outside the scope of trino documentation but possibly a link to the annotation to cause metrics to be scraped would be useful. In many cases the tooling will be using things like grafana-agent grafana alloy opentelemetry agent to push data to a more modern stack like cortex / mimir as running prometheus in the above way is somewhat uncommon now.
Other points:
- Documentation still does not mention configuration property openmetrics.jmx-object-names=
- Documentation could mention some suggested metrics
- Documentation based on an unsecured cluster is misleadingly simple given the complexities described here
Thank you for taking the time to review @electrum