etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Track stabilization --experimental-wait-cluster-ready-timeout

Open serathius opened this issue 3 years ago • 7 comments

This is tracking issue that tracks graduation of this feature. Let's work towards stabilization of the feature by discussing what steps should be taken for graduation. Context https://github.com/etcd-io/etcd/issues/13775

serathius avatar Mar 10 '22 16:03 serathius

This is a new flag being added in 3.6 recently. Normally we should graduate a flag at least in the next release (>=3.7). But for this flag, since it's low risk, I am OK to graduate it in 3.6 if there is no any objections. cc @serathius @spzala @ptabor

ahrtr avatar Mar 11 '22 00:03 ahrtr

I have bulk created issues to graduate all experimental flag. For this flag that was added v3.6 it should be reasonable to wait for v3.7.

serathius avatar Mar 14 '22 09:03 serathius

Hi, not sure if this is the correct place to ask, but noticed one slight issue with this new flag. We don't notify to systemd that we are ready to go which means that etcd ends up in a restart loop in this scenario.

I've fixed this in a local install by changing startEtcd in etcdmain\etcd.go to also timeout according to the config flag.

I'm ok with putting up an MR for this, but wanted to check that this was a desired change in the first place, and also suggestions to where to add tests for something like that

niconorsk avatar Aug 24 '22 22:08 niconorsk

We don't notify to systemd that we are ready to go which means that etcd ends up in a restart loop in this scenario.

I am not sure I get your point. Do you mean systemd restarted etcd because etcd blocked on serve.go#L105?

The PR isn't cherry picked to 3.5. Did you build etcd on main branch yourself or see this issue in 3.6 alpha?

ahrtr avatar Aug 24 '22 23:08 ahrtr

I tested this with a patch on 3.3.11 because that is the old version I am using but the code in question does not to look to have changed.

Specifically, etcd blocks on etcd.go#L208

Which means that it nevers sends the message back to systemd saying it should be considered started and so systemd eventually times it out and restarts it

niconorsk avatar Aug 25 '22 03:08 niconorsk

Got it. It is a real issue to me. If other members get started too late or slow for whatever reason, then the running member may be restarted by systemd; accordingly it makes the situation even worse.

Thanks for raising this. We should add code something like below. Please feel free to deliver a PR for this. But please add an e2e test case.

diff --git a/server/etcdmain/etcd.go b/server/etcdmain/etcd.go
index f35ebde6b..7f3ad1cd6 100644
--- a/server/etcdmain/etcd.go
+++ b/server/etcdmain/etcd.go
@@ -19,6 +19,7 @@ import (
        "os"
        "runtime"
        "strings"
+       "time"
 
        "go.etcd.io/etcd/client/pkg/v3/fileutil"
        "go.etcd.io/etcd/client/pkg/v3/logutil"
@@ -207,6 +208,8 @@ func startEtcd(cfg *embed.Config) (<-chan struct{}, <-chan error, error) {
        select {
        case <-e.Server.ReadyNotify(): // wait for e.Server to join the cluster
        case <-e.Server.StopNotify(): // publish aborted from 'ErrStopped'
+       case <- time.After(cfg.ExperimentalWaitClusterReadyTimeout):
+               e.GetLogger().Warn("startEtcd: timed out waiting for the ready notification")
        }
        return e.Server.StopNotify(), e.Err(), nil
 }

ahrtr avatar Aug 25 '22 04:08 ahrtr

I think that MR is good to go but needs someone to allow CI to run

niconorsk avatar Aug 26 '22 00:08 niconorsk