etcd Flake TestRevision

Which github workflows are flaking?

unit tests

Which tests are flaking?

go.etcd.io/etcd/server/v3/etcdserver/api/v3compactor: TestRevision

Github Action link

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/etcd-io_etcd/17469/pull-etcd-unit-test/1760242318838861824

Reason for failure (if possible)

{Failed  === RUN   TestRevision
    logger.go:130: 2024-02-21T09:59:44.200Z	INFO	starting auto revision compaction	{"revision": 90, "revision-compaction-retention": 10}
    revision_test.go:48: len(actions) = 0, expected >= 1
--- FAIL: TestRevision (0.05s)
}

Anything else we need to know?

No response

Feb 22 '24 12:02 tjungblu

Hi,

from my understanding this comes from the wait timeout here:

https://github.com/etcd-io/etcd/blob/47e9a16a5c76718240b21e9be959c26556e87955/server/etcdserver/api/v3compactor/revision_test.go#L31

How do you normally handle such a case?

Feb 22 '24 20:02 ybaldus

Following https://github.com/etcd-io/etcd/issues/17054#issuecomment-1843017613, adding a time.Sleep(11 * time.Millisecond) on https://github.com/etcd-io/etcd/blob/7ded2ac8de85509f30e4c4125b061ff36b0160bb/server/etcdserver/api/v3compactor/revision.go#L79 will allow you to reproduce the issue

I could be wrong, but it seems every rg.Wait(1) in the test if effectively just a time.Sleep(10 * time.Millisecond) + recorder stream channel flush

Unfortunately, the fix described in https://github.com/etcd-io/etcd/pull/17513 is not enough as "do not timeout" causes the very first rg.Wait(1) to wait forever as the fc.Advance(revInterval) that precedes it advances time before the Revision.Run loop has started (you can check this by sleeping before advancing time)

There is also this rg.Wait(1) which seems to serve no purpose https://github.com/etcd-io/etcd/blob/7ded2ac8de85509f30e4c4125b061ff36b0160bb/server/etcdserver/api/v3compactor/revision_test.go#L56

I believe to fix this, we would need to know when the Revision.Run loop is ready A hacky way to implement this could be:

func newFakerClock() *fakerClock {
	return &fakerClock{
		FakeClock:    clockwork.NewFakeClock(),
		afterRequest: make(chan struct{}),
	}
}

func (frc *fakerClock) After(d time.Duration) <-chan time.Time {
	select {
	case frc.afterRequest <- struct{}{}:
	default:
	}
	return frc.FakeClock.After(d)
}

func TestRevision(t *testing.T) {
	fc := newFakerClock() // <- this changed
	rg := &fakeRevGetter{testutil.NewRecorderStreamWithWaitTimout(math.MaxInt64), 0}  // <- this changed
	compactable := &fakeCompactable{testutil.NewRecorderStreamWithWaitTimout(10 * time.Millisecond)}
	tb := newRevision(zaptest.NewLogger(t), fc, 10, rg, compactable)

	tb.Run()
	defer tb.Stop()

	<-fc.afterRequest  // <- this is new
	fc.Advance(revInterval)
	rg.Wait(1)

// ...

though I do not know if that is an acceptable solution. Thoughts @ahrtr ?

Apr 13 '24 22:04 Fube