TSM file state leads to inescapable loop where compactFull is not running
We were seeing a state in which a shard would not perform full compactions leading to a build up of level 4 TSM files.
File state:
-rw-r--r--. 1 root root 2.1G Aug 5 20:59 000016684-000000007.tsm
-rw-r--r--. 1 root root 2.1G Aug 5 21:02 000016684-000000008.tsm
-rw-r--r--. 1 root root 2.1G Aug 5 21:04 000016684-000000009.tsm
-rw-r--r--. 1 root root 376M Aug 5 21:05 000016684-000000010.tsm
-rw-r--r--. 1 root root 2.1G Aug 5 18:00 000016812-000000004.tsm
-rw-r--r--. 1 root root 1.4G Aug 5 18:00 000016812-000000005.tsm
-rw-r--r--. 1 root root 1.3G Aug 5 21:21 000016844-000000002.tsm
-rw-r--r--. 1 root root 2.1G Aug 5 18:00 000016948-000000004.tsm
-rw-r--r--. 1 root root 1.4G Aug 5 18:00 000016948-000000005.tsm
-rw-r--r--. 1 root root 2.1G Aug 5 18:00 000017076-000000004.tsm
There is a rouge level 2 file packed within fully compacted files
-rw-r--r--. 1 root root 2.1G Aug 5 20:59 000016684-000000007.tsm
-rw-r--r--. 1 root root 2.1G Aug 5 21:02 000016684-000000008.tsm
-rw-r--r--. 1 root root 2.1G Aug 5 21:04 000016684-000000009.tsm
-rw-r--r--. 1 root root 376M Aug 5 21:05 000016684-000000010.tsm
and level 4 files
-rw-r--r--. 1 root root 2.1G Aug 5 18:00 000016948-000000004.tsm
-rw-r--r--. 1 root root 1.4G Aug 5 18:00 000016948-000000005.tsm
-rw-r--r--. 1 root root 2.1G Aug 5 18:00 000017076-000000004.tsm
The area of our code that would cause this state to be skipped would be here
https://github.com/influxdata/influxdb/blob/22bec4f046a28e3f1fa815705362767151407e1b/tsdb/engine/tsm1/compact.go#L620-L670
We need to add some sort of escape mechanism that would allow for compactions to occur or simplify this logic.
Steps to reproduce: It would be very difficult to replicate this issue, we believe it was an artifact from running compactions on v1.12.1. We understand that the state outlined above would result in a loop that never fully compacts TSM files.
We note that no released version of influxdb 1.x will enter this state. It is only possible through manual deletion of tsm files. It happened for us in a release candidate with a bug. Should this state occur (manually) we would like influxd to exit it.