thanos icon indicating copy to clipboard operation
thanos copied to clipboard

receive: wasting CPU on GC

Open dctrwatson opened this issue 1 year ago • 6 comments

Thanos, Prometheus and Golang version used:

thanos, version 0.32.5 (branch: HEAD, revision: 750e8a94eed5226cd4562117295d540a968c163c)
  build user:       root@053ebc7b5322
  build date:       20231019-04:13:41
  go version:       go1.21.3
  platform:         linux/amd64
  tags:             netgo

Object Storage Provider: s3

What happened: after running for some time, receive pegs CPU.

What you expected to happen: CPU usage to be proportional to write load

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Anything else we need to know: https://pprof.me/aa81313c5472bcec2e81765384e11748

dctrwatson avatar Dec 29 '23 23:12 dctrwatson

What options do you have enabled? Have you tried --writer.intern?

GiedriusS avatar Dec 30 '23 10:12 GiedriusS

To add to that, would be great to see the full configuration of the receiver including flags and hashring. It looks like it's stuck in GC so I wonder if there is a routing loop.

fpetkovski avatar Jan 01 '24 20:01 fpetkovski

receive
--log.level=info
--log.format=logfmt
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
--remote-write.address=0.0.0.0:19291
--objstore.config=$(OBJSTORE_CONFIG)
--tsdb.path=/var/thanos/receive
--label=thanos_receive_replica="$(NAME)"
--label=receive="true"
--tsdb.retention=26h
--receive.local-endpoint=$(NAME).thanos-receive-headless.$(NAMESPACE).svc.cluster.local.:10901
--grpc-server-tls-cert=/cert/tls.crt
--grpc-server-tls-key=/cert/tls.key
--grpc-server-tls-client-ca=/cert/ca.crt
--label=metrics_namespace="global"
--receive.tenant-label-name=cluster
--receive.default-tenant-id=unknown
--receive.hashrings-file-refresh-interval=1m
--remote-write.server-tls-cert=/cert/tls.crt
--remote-write.server-tls-client-ca=/cert/ca.crt
--remote-write.server-tls-key=/cert/tls.key
--tsdb.memory-snapshot-on-shutdown
--tsdb.max-block-duration=1h
--tsdb.min-block-duration=1h
--writer.intern

We're running with distributor:

receive
--log.level=info
--log.format=logfmt
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
--remote-write.address=0.0.0.0:19291
--label=replica="$(NAME)"
--label=receive="true"
--receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
--receive.replication-factor=1
--grpc-server-tls-cert=/cert/tls.crt
--grpc-server-tls-key=/cert/tls.key
--grpc-server-tls-client-ca=/cert/ca.crt
--receive.grpc-compression=snappy
--receive-forward-timeout=30s
--receive.hashrings-algorithm=ketama
--receive.hashrings-file-refresh-interval=1m
--receive.relabel-config=$(RECEIVE_RELABEL_CONFIG)
--receive.tenant-label-name=cluster
--receive.default-tenant-id=unknown
--remote-write.client-tls-ca=/cert/ca.crt
--remote-write.client-tls-cert=/cert/tls.crt
--remote-write.client-tls-key=/cert/tls.key
--remote-write.client-server-name=thanos-receive-headless.monitoring.svc.cluster.local
--remote-write.server-tls-cert=/cert/tls.crt
--remote-write.server-tls-client-ca=/cert/ca.crt
--remote-write.server-tls-key=/cert/tls.key

The hashring is managed by https://github.com/observatorium/thanos-receive-controller

dctrwatson avatar Jan 03 '24 19:01 dctrwatson

I cannot see anything wrong in the configuration. Maybe you can take a look at an allocation profile to see where objects are being allocated.

fpetkovski avatar Jan 04 '24 08:01 fpetkovski

Is the problematic one the router or the ingester?

MichaHoffmann avatar Jan 13 '24 21:01 MichaHoffmann

Is the problematic one the router or the ingester?

Ingester

dctrwatson avatar Jan 15 '24 21:01 dctrwatson