tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Metrics-generator sometime doesn't clean WAL files after checkpoint/GC

Open luistilingue opened this issue 1 year ago • 2 comments

Tempo Version: 1.5.0 (and lower version 1.4.1) Environment: GKE 1.22.x (staging and production), deployed using Helm chart tempo-distributed Storage: GCS

Behavior: Sometimes (intermittently), metrics-generator doesn't clean WAL files and its disk usage keeps growing indefinitely. That issue only occurs in some pods. After restart these pods, the WAL files are automatically cleaned.

Logs:

level=info ts=2022-08-16T07:32:43.160501839Z caller=registry.go:214 tenant=xxxx msg="collecting metrics" active_series=91773
level=info ts=2022-08-16T07:32:48.882228576Z caller=memberlist_client.go:498 msg="re-joined memberlist cluster" reached_nodes=47
level=info ts=2022-08-16T07:32:49.347881623Z caller=db.go:557 tenant=xxxx component=wal msg="series GC completed" duration=62.111636ms
level=info ts=2022-08-16T07:32:49.348110023Z caller=checkpoint.go:98 tenant=xxxx component=wal msg="Creating checkpoint" from_segment=20 to_segment=22 mint=1660625397000
level=info ts=2022-08-16T07:32:54.662843729Z caller=db.go:628 tenant=xxxx component=wal msg="WAL checkpoint complete" first=20 last=22 duration=5.377073422s
level=info ts=2022-08-16T07:32:58.029967412Z caller=registry.go:214 tenant=xxxx msg="collecting metrics" active_series=92664

WAL directory:

/var/tempo/wal/xxxx/wal # ls -lh
total 262M   
-rw-r--r--    1 root     root      127.9M Aug 16 11:16 00000029
-rw-r--r--    1 root     root       46.9M Aug 16 11:33 00000030
-rw-r--r--    1 root     root       87.5M Aug 16 12:02 00000031
drwxr-xr-x    2 root     root        4.0K Aug 16 11:33 checkpoint.00000028

/var/tempo/wal/xxxx/wal # ls -lh checkpoint.00000028/
total 750M   
-rw-r--r--    1 root     root      127.9M Aug 16 11:33 00000000
-rw-r--r--    1 root     root      127.6M Aug 16 11:33 00000001
-rw-r--r--    1 root     root      127.6M Aug 16 11:33 00000002
-rw-r--r--    1 root     root      127.8M Aug 16 11:33 00000003
-rw-r--r--    1 root     root      127.8M Aug 16 11:33 00000004
-rw-r--r--    1 root     root      110.9M Aug 16 11:33 00000005

PVC usage panels:

image (22)

So, is there anything could I change in Tempo configuration to fix that?

luistilingue avatar Aug 19 '22 23:08 luistilingue

We're also getting hit by this, the metrics-generator pod is eventually killed because it exceeds the limit of the ephemeral storage

Symbianx avatar Sep 19 '22 08:09 Symbianx

moving some comments made by @rfratto in slack to this thread:

the expected behavior of the WAL cleanup is to:

  • Create a new segment file
  • Create a new checkpoint folder, moving the oldest two-thirds of segment files in the WAL to the folder
  • Remove old checkpoint folders If the rate of created metrics is increasing, the WAL cleanup won't be able to catch up and disk space will continue growing do you know what specifically doesn't seem to be working properly?

level=info ts=2022-08-16T07:32:49.348110023Z caller=checkpoint.go:98 tenant=xxxx component=wal msg="Creating checkpoint" from_segment=20 to_segment=22 mint=1660625397000 level=info ts=2022-08-16T07:32:54.662843729Z caller=db.go:628 tenant=xxxx component=wal msg="WAL checkpoint complete" first=20 last=22 duration=5.377073422s

do you still see this being logged every time a checkpoint is supposed to be made?

joe-elliott avatar Sep 19 '22 12:09 joe-elliott

We recently fixed a remote write bug by upgrading our Prometheus dependency: https://github.com/grafana/tempo/pull/1752

If anyone is seeing this issue can they double check if the PR above fixes it.

joe-elliott avatar Oct 07 '22 13:10 joe-elliott

I'll check it if PVCs won't be full filled over time.

luistilingue avatar Oct 07 '22 20:10 luistilingue

@joe-elliott I guess it was fixed by upgrading Prometheus depedency. It is a very odd issue. Thank you very much.

luistilingue avatar Oct 11 '22 17:10 luistilingue

Nice! Glad to get this one wrapped up. Thanks for reporting back.

joe-elliott avatar Oct 11 '22 17:10 joe-elliott