thanos
thanos copied to clipboard
receive: wasting CPU on GC
Thanos, Prometheus and Golang version used:
thanos, version 0.32.5 (branch: HEAD, revision: 750e8a94eed5226cd4562117295d540a968c163c)
build user: root@053ebc7b5322
build date: 20231019-04:13:41
go version: go1.21.3
platform: linux/amd64
tags: netgo
Object Storage Provider: s3
What happened: after running for some time, receive pegs CPU.
What you expected to happen: CPU usage to be proportional to write load
How to reproduce it (as minimally and precisely as possible):
Full logs to relevant components:
Anything else we need to know: https://pprof.me/aa81313c5472bcec2e81765384e11748
What options do you have enabled? Have you tried --writer.intern
?
To add to that, would be great to see the full configuration of the receiver including flags and hashring. It looks like it's stuck in GC so I wonder if there is a routing loop.
receive
--log.level=info
--log.format=logfmt
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
--remote-write.address=0.0.0.0:19291
--objstore.config=$(OBJSTORE_CONFIG)
--tsdb.path=/var/thanos/receive
--label=thanos_receive_replica="$(NAME)"
--label=receive="true"
--tsdb.retention=26h
--receive.local-endpoint=$(NAME).thanos-receive-headless.$(NAMESPACE).svc.cluster.local.:10901
--grpc-server-tls-cert=/cert/tls.crt
--grpc-server-tls-key=/cert/tls.key
--grpc-server-tls-client-ca=/cert/ca.crt
--label=metrics_namespace="global"
--receive.tenant-label-name=cluster
--receive.default-tenant-id=unknown
--receive.hashrings-file-refresh-interval=1m
--remote-write.server-tls-cert=/cert/tls.crt
--remote-write.server-tls-client-ca=/cert/ca.crt
--remote-write.server-tls-key=/cert/tls.key
--tsdb.memory-snapshot-on-shutdown
--tsdb.max-block-duration=1h
--tsdb.min-block-duration=1h
--writer.intern
We're running with distributor:
receive
--log.level=info
--log.format=logfmt
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
--remote-write.address=0.0.0.0:19291
--label=replica="$(NAME)"
--label=receive="true"
--receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
--receive.replication-factor=1
--grpc-server-tls-cert=/cert/tls.crt
--grpc-server-tls-key=/cert/tls.key
--grpc-server-tls-client-ca=/cert/ca.crt
--receive.grpc-compression=snappy
--receive-forward-timeout=30s
--receive.hashrings-algorithm=ketama
--receive.hashrings-file-refresh-interval=1m
--receive.relabel-config=$(RECEIVE_RELABEL_CONFIG)
--receive.tenant-label-name=cluster
--receive.default-tenant-id=unknown
--remote-write.client-tls-ca=/cert/ca.crt
--remote-write.client-tls-cert=/cert/tls.crt
--remote-write.client-tls-key=/cert/tls.key
--remote-write.client-server-name=thanos-receive-headless.monitoring.svc.cluster.local
--remote-write.server-tls-cert=/cert/tls.crt
--remote-write.server-tls-client-ca=/cert/ca.crt
--remote-write.server-tls-key=/cert/tls.key
The hashring is managed by https://github.com/observatorium/thanos-receive-controller
I cannot see anything wrong in the configuration. Maybe you can take a look at an allocation profile to see where objects are being allocated.
Is the problematic one the router or the ingester?
Is the problematic one the router or the ingester?
Ingester