stolon cluster with kubernetes backend constantly updates the pods metadata

Bug report

Environment

kubernetes 1.9

Stolon version

master-pg9.6 0.10.0-pg10

Expected behaviour you didn't see

pods do not update after initialisation

Unexpected behaviour you saw

pods constantly getting updated. https://gist.github.com/lwolf/2981232c2ccaa87e3d15681bcc425fe0 Diff of the same pod after few seconds shows that infoUID in metadata.annotations.stolon-status is constantly getting new value along with resourceVersion

Steps to reproduce the problem

deploy kubernetes example from this repository

Mar 30 '18 16:03 lwolf

@lwolf sorry but I'm missing something. It's normal that the keeper/proxy updates its state every few seconds and with the kubernetes store its state is saved in the pod metadata.

Mar 30 '18 16:03 sgotti

Looks like I'm also missing something. I didn't look in the implementation of the new backend versus old ones, but I suppose the main difference is where the state is saved.
With etcd backend, pods did not behave this way.

I'm using kubectl get pods -w all the time, and this flood of object updates does not feel right.

Mar 30 '18 18:03 lwolf

@lwolf the k8s api don't have a TTL to expire keys like with etcd/consul so the faster way I found is to save the keeper and proxy state in their pod metadata so it''s automatically "expired" when the pod resource is removed. Another solution will be to expose an endpoint from the keeper/proxy to query the state directly from the process. It could be a future change (PRs are welcome).

Mar 30 '18 18:03 sgotti

I see, thanks for the explanation. I really like to idea of kubernetes backend for stolon, but this state updates are stopping me from using it for now.

Would be great if you could expand a little on what exactly should be done for the second approach. Like who should gather the state from that new endpoint inside keeper/proxy?

Mar 30 '18 19:03 lwolf

The keepers and proxies will listen on a address:port and expose an http endpoint that when called will provide their state. The sentinel will query them and use this data instead of the data written to etcd/consul or the pod metadata. I haven't done this because using only the store doesn't require a second communication path.

I understand your issue with watching pods changes but I'm not sure it's a bad annotation usage and the same happens when saving cluster data in the configmap annotation (and we can't do nothing here). A solution will be to improve kubectl to ignore reporting annotation changes.

BTW using a dedicated etcd server instead of the k8s api will provide greater availability since it's not impacted by using the k8s api servers (shared between all the k8s components, could be down when updating k8s etc... : see the architecture doc).

Mar 31 '18 12:03 sgotti

I agree that having separate etcd cluster is preferable for HA clusters. But I like the idea of using k8s as backend for cases when availability of stolon is not that critical.

I had an idea that having the state in the status part of an object could solve the issue with updates. I though that kubectl could ignore some changes there. I spent a few hours hacking with go-client, updating different parts of the object, but it seems that any update to the object is visible to kubectl.

Another possible solution could be to use CustomResources for the state.

Apr 01 '18 12:04 lwolf

Another possible solution could be to use CustomResources for the state.

CRD aren't really suited for this king of things, they are for defining custom resources. In stolon we don't have custom resources but just one big thing that must be atomic called clusterdata and components that public their state. Using a CRD for components that public their states isn't really good and doesn't solve the problem to remove them when the pod exits (every proxy gets a different uid at every start). We cannot also use CRD to save the cluster spec because CRD doesn't permit the level of validation we need before saving the new value (like stolonctl does).

Just note that also the configmap used to save clusterdata is an hack. We don't use any configmap feature but save the clusterdata in a configmap annotation. We just use a configmap because it (with the endpoint) are the resources that in the k8s client already implements leader election.

Another solution, to not use a "status" endpoint and not save component status in their pods, will be to use an additional resource, shared by all the components where they'll write their status (using different annotations for every component) and the sentinel will periodically clean up old annotations. If someone wants to try this I'm open for PRs.

Apr 09 '18 07:04 sgotti

Another concern is that some controllers are listening for pod events (i.e. ingress controllers), and keeper/proxy updates cause some CPU (and logging) overhead.

May 10 '18 20:05 drdivano

We've observed the same thing. There are other side effects which go beyond the one described here:

The stolon service account is among the top 3 k8s API users - right beside the apiserver and kubelets. This is on a 15 node cluster.
Measurable impact on k8s api server performance.

Aug 29 '19 13:08 mindw

@sgotti , K8s has the lease API https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.16/#lease-v1-coordination-k8s-io . Seemingly it was designed to replace the configmap based leader election. Perhaps it can used for an alternative implementation?

Nov 30 '19 07:11 mindw

@mindw This issue isn't related to the use of a configmap vs lease api. We are using a configmap to store the clusterdata and while we have it also for sentinel leader election, the lease api won't work since we also have to store the configdata and we'll need a configmap anyway.

This issue is related to the fact that the keepers and pods write their status to their own pod metadata and this is reflected when one watches for pod changes. I personally think that this isn't an issue and should be fixed on the kubectl/kube api side by providing a way to filter updates types. But if you want you can implement something like the one I proposed at the end of this https://github.com/sorintlab/stolon/issues/463#issuecomment-379666733

Dec 02 '19 08:12 sgotti

stolon stolon copied to clipboard

cluster with kubernetes backend constantly updates the pods metadata

Environment

Stolon version

Expected behaviour you didn't see

Unexpected behaviour you saw

Steps to reproduce the problem

stolon
stolon copied to clipboard