influxdb icon indicating copy to clipboard operation
influxdb copied to clipboard

TSM file state leads to inescapable loop where compactFull is not running

Open devanbenz opened this issue 4 months ago • 1 comments

We were seeing a state in which a shard would not perform full compactions leading to a build up of level 4 TSM files.

File state:

-rw-r--r--.  1 root root 2.1G Aug  5 20:59 000016684-000000007.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 21:02 000016684-000000008.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 21:04 000016684-000000009.tsm
-rw-r--r--.  1 root root 376M Aug  5 21:05 000016684-000000010.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000016812-000000004.tsm
-rw-r--r--.  1 root root 1.4G Aug  5 18:00 000016812-000000005.tsm
-rw-r--r--.  1 root root 1.3G Aug  5 21:21 000016844-000000002.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000016948-000000004.tsm
-rw-r--r--.  1 root root 1.4G Aug  5 18:00 000016948-000000005.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000017076-000000004.tsm

There is a rouge level 2 file packed within fully compacted files

-rw-r--r--.  1 root root 2.1G Aug  5 20:59 000016684-000000007.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 21:02 000016684-000000008.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 21:04 000016684-000000009.tsm
-rw-r--r--.  1 root root 376M Aug  5 21:05 000016684-000000010.tsm

and level 4 files

-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000016948-000000004.tsm
-rw-r--r--.  1 root root 1.4G Aug  5 18:00 000016948-000000005.tsm
-rw-r--r--.  1 root root 2.1G Aug  5 18:00 000017076-000000004.tsm

The area of our code that would cause this state to be skipped would be here

https://github.com/influxdata/influxdb/blob/22bec4f046a28e3f1fa815705362767151407e1b/tsdb/engine/tsm1/compact.go#L620-L670

We need to add some sort of escape mechanism that would allow for compactions to occur or simplify this logic.

Steps to reproduce: It would be very difficult to replicate this issue, we believe it was an artifact from running compactions on v1.12.1. We understand that the state outlined above would result in a loop that never fully compacts TSM files.

devanbenz avatar Aug 08 '25 17:08 devanbenz

We note that no released version of influxdb 1.x will enter this state. It is only possible through manual deletion of tsm files. It happened for us in a release candidate with a bug. Should this state occur (manually) we would like influxd to exit it.

philjb avatar Aug 08 '25 18:08 philjb