milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: wal truncator doesn't work if there's no writing after streamingnode restart or wal balance away.

Open chyezh opened this issue 2 months ago • 3 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: v2.6.1
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The wal truncator doesn't persist any sample of checkpoint. So after restarting streamingnode, the sample for truncator is lost. Meanwhile, the samplingTruncator initilize the lastSampled instant with time.Now(). So the checkpoint in half-hour after streamingnode restart is not sampled, so if the milvus doesn't write any data after half-hour(the sample interval), the truncator will not truncate the wal.

// newSamplingTruncator creates a new sampling truncator.
func newSamplingTruncator(
	checkpoint *WALCheckpoint,
	truncator walimpls.WALImpls,
	recoveryMetrics *recoveryMetrics,
) *samplingTruncator {
	st := &samplingTruncator{
		notifier:                syncutil.NewAsyncTaskNotifier[struct{}](),
		cfg:                     newTruncatorConfig(),
		truncator:               truncator,
		mu:                      sync.Mutex{},
		checkpointSamples:       []*WALCheckpoint{checkpoint},
		lastTruncatedCheckpoint: nil,
		lastSampled:             time.Now(),
		metrics:                 recoveryMetrics,
	}
	go st.background()
	return st
}```

also see #44369

### Expected Behavior

_No response_

### Steps To Reproduce

```markdown

Milvus Log

No response

Anything else?

No response

chyezh avatar Sep 18 '25 02:09 chyezh

/assign @chyezh

chyezh avatar Sep 18 '25 02:09 chyezh

Implement a truncator with persisted status

chyezh avatar Sep 18 '25 02:09 chyezh

will be fixed by #45350

chyezh avatar Nov 17 '25 02:11 chyezh