thanos
thanos copied to clipboard
Thanos receive eats all the memory when performs "WAL segment loaded" operation
Thanos, Prometheus and Golang version used: Thanos 0.32.5
Object Storage Provider: CEPH Rados (S3 compatible)
What happened: Thanos receive tries to replay WAL, eats all the RAM and dies
What you expected to happen: Is it any workaround to limit the resources for that step (or completely ignore that)?
How to reproduce it (as minimally and precisely as possible): Don't know yet. We have Prometheus which pushes the data to this Receive service, like:
### Thanos Receive service
remote_write:
- url: https://thanos-receive-thanos.apps.okd.elorg.int/api/v1/receive
tls_config:
insecure_skip_verify: true
I'd tried to remove everything (like disk data/etc, except the data in S3), but the problem persists. OKD not accepts any requests when thanos receive pod gets initialized, so I guess the receive service takes that directly from S3 bucket?
Full logs to relevant components:
ts=2023-10-27T00:53:24.618072078Z caller=main.go:67 level=debug msg="maxprocs: Updating GOMAXPROCS=[4]: determined from CPU quota"
ts=2023-10-27T00:53:24.618872551Z caller=factory.go:43 level=info msg="loading tracing configuration"
ts=2023-10-27T00:53:24.627085357Z caller=receive.go:134 level=info component=receive mode=RouterIngestor msg="running receive"
ts=2023-10-27T00:53:24.627149847Z caller=options.go:26 level=info component=receive protocol=HTTP msg="disabled TLS, key and cert must be set to enable"
ts=2023-10-27T00:53:24.627291187Z caller=factory.go:53 level=info component=receive msg="loading bucket configuration"
ts=2023-10-27T00:53:24.632078492Z caller=receive.go:736 level=info component=receive msg="default tenant data dir already present, not attempting to migrate storage"
ts=2023-10-27T00:53:24.632427571Z caller=receive.go:278 level=debug component=receive msg="setting up TSDB"
ts=2023-10-27T00:53:24.632489416Z caller=receive.go:585 level=debug component=receive msg="removing storage lock files if any"
ts=2023-10-27T00:53:24.636701835Z caller=multitsdb.go:478 level=info component=receive component=multi-tsdb msg="a leftover lockfile found and removed" tenant=default-tenant
ts=2023-10-27T00:53:24.636759175Z caller=receive.go:672 level=info component=receive component=uploader msg="upload enabled, starting initial sync"
ts=2023-10-27T00:53:24.636769054Z caller=receive.go:660 level=debug component=receive component=uploader msg="upload phase starting"
ts=2023-10-27T00:53:24.636828995Z caller=receive.go:668 level=debug component=receive component=uploader msg="upload phase done" uploaded=0 elapsed=44.727µs
ts=2023-10-27T00:53:24.636865283Z caller=receive.go:676 level=info component=receive component=uploader msg="initial sync done"
ts=2023-10-27T00:53:24.636877535Z caller=receive.go:286 level=debug component=receive msg="setting up hashring"
ts=2023-10-27T00:53:24.637190714Z caller=receive.go:293 level=debug component=receive msg="setting up HTTP server"
ts=2023-10-27T00:53:24.63725079Z caller=receive.go:311 level=debug component=receive msg="setting up gRPC server"
ts=2023-10-27T00:53:24.637266749Z caller=options.go:26 level=info component=receive protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
ts=2023-10-27T00:53:24.637877216Z caller=receive.go:384 level=debug component=receive msg="setting up receive HTTP handler"
ts=2023-10-27T00:53:24.63790946Z caller=receive.go:413 level=debug component=receive msg="setting up periodic tenant pruning"
ts=2023-10-27T00:53:24.637935857Z caller=receive.go:444 level=info component=receive msg="starting receiver"
ts=2023-10-27T00:53:24.63806818Z caller=intrumentation.go:75 level=info component=receive msg="changing probe status" status=healthy
ts=2023-10-27T00:53:24.638136186Z caller=http.go:73 level=info component=receive service=http/server component=receive msg="listening for requests and metrics" address=0.0.0.0:10902
ts=2023-10-27T00:53:24.63822119Z caller=receive.go:371 level=info component=receive msg="listening for StoreAPI and WritableStoreAPI gRPC" address=0.0.0.0:10901
ts=2023-10-27T00:53:24.638274309Z caller=intrumentation.go:75 level=info component=receive msg="changing probe status" status=healthy
ts=2023-10-27T00:53:24.638366321Z caller=handler.go:327 level=info component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19291
ts=2023-10-27T00:53:24.638509307Z caller=tls_config.go:274 level=info component=receive service=http/server component=receive msg="Listening on" address=[::]:10902
ts=2023-10-27T00:53:24.638526437Z caller=config.go:273 level=debug component=receive component=config-watcher msg="refreshed hashring config"
ts=2023-10-27T00:53:24.638546629Z caller=tls_config.go:277 level=info component=receive service=http/server component=receive msg="TLS is disabled." http2=false address=[::]:10902
ts=2023-10-27T00:53:24.638606309Z caller=receive.go:535 level=info component=receive msg="Set up hashring for the given hashring config."
ts=2023-10-27T00:53:24.638639779Z caller=intrumentation.go:67 level=warn component=receive msg="changing probe status" status=not-ready reason="hashring has changed; server is not ready to receive requests"
ts=2023-10-27T00:53:24.638659556Z caller=receive.go:631 level=info component=receive msg="hashring has changed; server is not ready to receive requests"
ts=2023-10-27T00:53:24.638665428Z caller=receive.go:633 level=info component=receive msg="updating storage"
ts=2023-10-27T00:53:24.638637055Z caller=grpc.go:131 level=info component=receive service=gRPC/server component=receive msg="listening for serving gRPC" address=0.0.0.0:10901
ts=2023-10-27T00:53:24.63874305Z caller=handler.go:354 level=info component=receive component=receive-handler msg="Serving plain HTTP" address=0.0.0.0:19291
ts=2023-10-27T00:53:24.639682768Z caller=multitsdb.go:570 level=info component=receive component=multi-tsdb tenant=default-tenant msg="opening TSDB"
ts=2023-10-27T00:53:24.645180296Z caller=repair.go:56 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found healthy block" mint=1698017477682 maxt=1698019200000 ulid=01HDQ7104R0V13E9QZ7WSZE1RJ
ts=2023-10-27T00:53:24.673478411Z caller=db.go:930 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Found and deleted tmp block dir" dir=/var/thanos/receive/default-tenant/01HDQ9W5MX3E5ZVBQCMG7FEZ89.tmp-for-creation
ts=2023-10-27T00:53:24.859919555Z caller=head.go:601 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Replaying on-disk memory mappable chunks if any"
ts=2023-10-27T00:53:54.638974414Z caller=receive.go:660 level=debug component=receive component=uploader msg="upload phase starting"
ts=2023-10-27T00:53:54.639042708Z caller=multitsdb.go:436 level=debug component=receive component=multi-tsdb msg="uploading block for tenant" tenant=default-tenant
ts=2023-10-27T00:53:54.639060444Z caller=receive.go:668 level=debug component=receive component=uploader msg="upload phase done" uploaded=0 elapsed=20.897µs
ts=2023-10-27T00:54:19.545589763Z caller=head.go:667 level=error component=receive component=multi-tsdb tenant=default-tenant msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 47496812, last chunk: [1698037825371, 1698038245371], new: [1698023770371, 1698024310371]"
ts=2023-10-27T00:54:19.545660368Z caller=head.go:907 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Deleting mmapped chunk files"
ts=2023-10-27T00:54:19.547780561Z caller=head.go:913 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Deleting mmapped chunk files"
ts=2023-10-27T00:54:19.547833159Z caller=head.go:916 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Deletion of corrupted mmap chunk files failed, discarding chunk files completely" err="cannot handle error: iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 47496812, last chunk: [1698037825371, 1698038245371], new: [1698023770371, 1698024310371]"
ts=2023-10-27T00:54:20.358718902Z caller=head.go:682 level=info component=receive component=multi-tsdb tenant=default-tenant msg="On-disk memory mappable chunks replay completed" duration=55.498736431s
ts=2023-10-27T00:54:20.358776854Z caller=head.go:690 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Replaying WAL, this may take a while"
ts=2023-10-27T00:54:24.638433244Z caller=receive.go:660 level=debug component=receive component=uploader msg="upload phase starting"
ts=2023-10-27T00:54:24.638530709Z caller=multitsdb.go:436 level=debug component=receive component=multi-tsdb msg="uploading block for tenant" tenant=default-tenant
ts=2023-10-27T00:54:24.63854684Z caller=receive.go:668 level=debug component=receive component=uploader msg="upload phase done" uploaded=0 elapsed=22.892µs
ts=2023-10-27T00:54:42.672586586Z caller=head.go:726 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WAL checkpoint loaded"
ts=2023-10-27T00:54:44.590346416Z caller=head.go:761 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WAL segment loaded" segment=2416 maxSegment=2812
ts=2023-10-27T00:54:46.581111697Z caller=head.go:761 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WAL segment loaded" segment=2417 maxSegment=2812
ts=2023-10-27T00:54:48.48897029Z caller=head.go:761 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WAL segment loaded" segment=2418 maxSegment=2812
ts=2023-10-27T00:54:50.330163387Z caller=head.go:761 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WAL segment loaded" segment=2419 maxSegment=2812
Maybe related to #4093