[bitnami/etcd] retrieable preupgrade.sh

Open mouchar opened this issue 6 months ago • 1 comments

Name and Version

bitnami/etcd:3.5.21-debian-12-r5

What is the problem this feature will solve?

The preupgrade.sh script which runs as a helm hook during helm upgrade etcd ... is too fast, and it occasionally initiates network connection before pod's networking is ready. When etcdctl opens socket trying to contact etcd cluster and retrieve member list, kubernetes networing still may not be fully initialized, so etcdctl times out after 5s. Increasing timeout will not help as the socket is already open and packets were sent.

In our case, we use Calico Networking and can see in our logs, that eth0 endpoint is brought up just about 300ms later than etcdctl member list is run by the preupgrade hook. Preupgrade fails, although the etcd cluster is fully operational:

2025-06-13T07:17:22.755728886Z etcd 07:17:22.75 INFO  ==> Welcome to the Bitnami etcd container
2025-06-13T07:17:22.756982359Z etcd 07:17:22.75 INFO  ==> Subscribe to project updates by watching https://github.com/bitnami/containers
2025-06-13T07:17:22.758264511Z etcd 07:17:22.75 INFO  ==> Did you know there are enterprise versions of the Bitnami catalog? For enhanced secure software supply chain features, unlimited pulls from Docker, LTS support, or application customization, see Bitnami Premium or Tanzu Application Catalog. See https://www.arrow.com/globalecs/na/vendors/bitnami/ for more information.
2025-06-13T07:17:22.759453290Z etcd 07:17:22.75 INFO  ==> 
2025-06-13T07:17:22.851452032Z 
2025-06-13T07:17:28.149734556Z {"level":"warn","ts":"2025-06-13T07:17:28.149528Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x4000438000/etcd-0.etcd-headless.dev-latest.svc.cluster.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2025-06-13T07:17:28.149770159Z Error: context deadline exceeded
2025-06-13T07:17:28.152789715Z etcd 07:17:28.15 ERROR ==> Unable to list members, are all members healthy?

I know it sounds weird to complain that "app is too fast" 😕, but...

What is the feature you are proposing to solve the problem?

Allow more control to when (or how many times) the initial etcdctl member list is run:

optional, configurable sleep $DELAY before running etcdctl - easy and good enough.
configurable reties when the etcdctl command times out. This solution can't distinguish why etcdctl failed (disconnected network, unresponsive etcd?) - more complex solution with various edge cases.
A completely different, Kubernetes-only approach - e.g. modify bitnami/etcd chart by adding initContainer which would check network availabilty first (how?).

Any other comments are warmly welcome.

What alternatives have you considered?

For now, I managed to decrease the failure probablility from 100% to 60% by loweing container's CPU limit to 100m. Since the preupgrade job is retries six times in case of failure, the helm upgrade usually succeeds. Further decreasing of CPU limit for that job would probably improve chances of successful run, but I don't like much this approach.

Jun 13 '25 15:06 mouchar

Thank you for bringing this issue to our attention. We appreciate your involvement! If you're interested in contributing a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.

Jun 16 '25 07:06 carrodher