influxdb icon indicating copy to clipboard operation
influxdb copied to clipboard

feat: Add WaitWithTimeout to Partition and WaitGroupTimeout

Open devanbenz opened this issue 8 months ago • 0 comments

This PR makes it easier to debug potential hanging retention service routines during DeleteShard.

Currently we are seeing the following traces within goroutine profiles for customers that are experiences issues where shards are persisting after the retention policy.

      1103   runtime.gopark
             runtime.selectgo
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).runPeriodicCompaction
        32   runtime.gopark
             runtime.goparkunlock (inline)
             runtime.semacquire1
             sync.runtime_Semacquire
             sync.(*WaitGroup).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*LogFile).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).compactLogFile
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).compact.func1
        16   runtime.gopark
             runtime.goparkunlock (inline)
             runtime.semacquire1
             sync.runtime_Semacquire
             sync.(*WaitGroup).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*IndexFile).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).compactToLevel
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).compact.func2.1
         1   runtime.gopark
             runtime.chanrecv
             runtime.chanrecv1
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
             github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
             github.com/influxdata/influxdb/tsdb.(*Shard).Close
             github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
             github.com/influxdata/influxdb/services/retention.(*Service).run
             github.com/influxdata/influxdb/services/retention.(*Service).Open.func1

Where Wait is here https://github.com/influxdata/influxdb/pull/26294/files#diff-55346f580e7216556be601bef5602df49cf19af75131749c46096475d68126f9R379

I believe that somehow we are infinitely waiting for CurrentCompactionN and yet we are never decrementing to 0. This is causing the retention policy to hang when it gets to Partition.Close

The following PR will not resolve the issue but it will show whether my theory is correct and the root cause of the issue.

devanbenz avatar Apr 18 '25 18:04 devanbenz