tempo
tempo copied to clipboard
Metrics-generator sometime doesn't clean WAL files after checkpoint/GC
Tempo Version: 1.5.0 (and lower version 1.4.1)
Environment: GKE 1.22.x (staging and production), deployed using Helm chart tempo-distributed
Storage: GCS
Behavior: Sometimes (intermittently), metrics-generator doesn't clean WAL files and its disk usage keeps growing indefinitely. That issue only occurs in some pods. After restart these pods, the WAL files are automatically cleaned.
Logs:
level=info ts=2022-08-16T07:32:43.160501839Z caller=registry.go:214 tenant=xxxx msg="collecting metrics" active_series=91773
level=info ts=2022-08-16T07:32:48.882228576Z caller=memberlist_client.go:498 msg="re-joined memberlist cluster" reached_nodes=47
level=info ts=2022-08-16T07:32:49.347881623Z caller=db.go:557 tenant=xxxx component=wal msg="series GC completed" duration=62.111636ms
level=info ts=2022-08-16T07:32:49.348110023Z caller=checkpoint.go:98 tenant=xxxx component=wal msg="Creating checkpoint" from_segment=20 to_segment=22 mint=1660625397000
level=info ts=2022-08-16T07:32:54.662843729Z caller=db.go:628 tenant=xxxx component=wal msg="WAL checkpoint complete" first=20 last=22 duration=5.377073422s
level=info ts=2022-08-16T07:32:58.029967412Z caller=registry.go:214 tenant=xxxx msg="collecting metrics" active_series=92664
WAL directory:
/var/tempo/wal/xxxx/wal # ls -lh
total 262M
-rw-r--r-- 1 root root 127.9M Aug 16 11:16 00000029
-rw-r--r-- 1 root root 46.9M Aug 16 11:33 00000030
-rw-r--r-- 1 root root 87.5M Aug 16 12:02 00000031
drwxr-xr-x 2 root root 4.0K Aug 16 11:33 checkpoint.00000028
/var/tempo/wal/xxxx/wal # ls -lh checkpoint.00000028/
total 750M
-rw-r--r-- 1 root root 127.9M Aug 16 11:33 00000000
-rw-r--r-- 1 root root 127.6M Aug 16 11:33 00000001
-rw-r--r-- 1 root root 127.6M Aug 16 11:33 00000002
-rw-r--r-- 1 root root 127.8M Aug 16 11:33 00000003
-rw-r--r-- 1 root root 127.8M Aug 16 11:33 00000004
-rw-r--r-- 1 root root 110.9M Aug 16 11:33 00000005
PVC usage panels:

So, is there anything could I change in Tempo configuration to fix that?
We're also getting hit by this, the metrics-generator pod is eventually killed because it exceeds the limit of the ephemeral storage
moving some comments made by @rfratto in slack to this thread:
the expected behavior of the WAL cleanup is to:
- Create a new segment file
- Create a new checkpoint folder, moving the oldest two-thirds of segment files in the WAL to the folder
- Remove old checkpoint folders If the rate of created metrics is increasing, the WAL cleanup won't be able to catch up and disk space will continue growing do you know what specifically doesn't seem to be working properly?
level=info ts=2022-08-16T07:32:49.348110023Z caller=checkpoint.go:98 tenant=xxxx component=wal msg="Creating checkpoint" from_segment=20 to_segment=22 mint=1660625397000 level=info ts=2022-08-16T07:32:54.662843729Z caller=db.go:628 tenant=xxxx component=wal msg="WAL checkpoint complete" first=20 last=22 duration=5.377073422s
do you still see this being logged every time a checkpoint is supposed to be made?
We recently fixed a remote write bug by upgrading our Prometheus dependency: https://github.com/grafana/tempo/pull/1752
If anyone is seeing this issue can they double check if the PR above fixes it.
I'll check it if PVCs won't be full filled over time.
@joe-elliott I guess it was fixed by upgrading Prometheus depedency. It is a very odd issue. Thank you very much.
Nice! Glad to get this one wrapped up. Thanks for reporting back.