stolon
stolon copied to clipboard
cluster with kubernetes backend constantly updates the pods metadata
- Bug report
Environment
kubernetes 1.9
Stolon version
master-pg9.6 0.10.0-pg10
Expected behaviour you didn't see
pods do not update after initialisation
Unexpected behaviour you saw
pods constantly getting updated.
https://gist.github.com/lwolf/2981232c2ccaa87e3d15681bcc425fe0
Diff of the same pod after few seconds shows that infoUID
in metadata.annotations.stolon-status
is constantly getting new value along with resourceVersion
Steps to reproduce the problem
deploy kubernetes example from this repository
@lwolf sorry but I'm missing something. It's normal that the keeper/proxy updates its state every few seconds and with the kubernetes store its state is saved in the pod metadata.
Looks like I'm also missing something. I didn't look in the implementation of the new backend versus old ones, but I suppose the main difference is where the state is saved.
With etcd backend, pods did not behave this way.
I'm using kubectl get pods -w
all the time, and this flood of object updates does not feel right.
@lwolf the k8s api don't have a TTL to expire keys like with etcd/consul so the faster way I found is to save the keeper and proxy state in their pod metadata so it''s automatically "expired" when the pod resource is removed. Another solution will be to expose an endpoint from the keeper/proxy to query the state directly from the process. It could be a future change (PRs are welcome).
I see, thanks for the explanation. I really like to idea of kubernetes backend for stolon, but this state updates are stopping me from using it for now.
Would be great if you could expand a little on what exactly should be done for the second approach. Like who should gather the state from that new endpoint inside keeper/proxy?
The keepers and proxies will listen on a address:port and expose an http endpoint that when called will provide their state. The sentinel will query them and use this data instead of the data written to etcd/consul or the pod metadata. I haven't done this because using only the store doesn't require a second communication path.
I understand your issue with watching pods changes but I'm not sure it's a bad annotation usage and the same happens when saving cluster data in the configmap annotation (and we can't do nothing here). A solution will be to improve kubectl to ignore reporting annotation changes.
BTW using a dedicated etcd server instead of the k8s api will provide greater availability since it's not impacted by using the k8s api servers (shared between all the k8s components, could be down when updating k8s etc... : see the architecture doc).
I agree that having separate etcd cluster is preferable for HA clusters. But I like the idea of using k8s as backend for cases when availability of stolon is not that critical.
I had an idea that having the state in the status
part of an object could solve the issue with updates. I though that kubectl could ignore some changes there.
I spent a few hours hacking with go-client, updating different parts of the object, but it seems that any update to the object is visible to kubectl.
Another possible solution could be to use CustomResources for the state.
Another possible solution could be to use CustomResources for the state.
CRD aren't really suited for this king of things, they are for defining custom resources. In stolon we don't have custom resources but just one big thing that must be atomic called clusterdata and components that public their state. Using a CRD for components that public their states isn't really good and doesn't solve the problem to remove them when the pod exits (every proxy gets a different uid at every start). We cannot also use CRD to save the cluster spec because CRD doesn't permit the level of validation we need before saving the new value (like stolonctl does).
Just note that also the configmap used to save clusterdata is an hack. We don't use any configmap feature but save the clusterdata in a configmap annotation. We just use a configmap because it (with the endpoint) are the resources that in the k8s client already implements leader election.
Another solution, to not use a "status" endpoint and not save component status in their pods, will be to use an additional resource, shared by all the components where they'll write their status (using different annotations for every component) and the sentinel will periodically clean up old annotations. If someone wants to try this I'm open for PRs.
Another concern is that some controllers are listening for pod events (i.e. ingress controllers), and keeper/proxy updates cause some CPU (and logging) overhead.
We've observed the same thing. There are other side effects which go beyond the one described here:
- The stolon service account is among the top 3 k8s API users - right beside the apiserver and kubelets. This is on a 15 node cluster.
- Measurable impact on k8s api server performance.
@sgotti , K8s has the lease API https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.16/#lease-v1-coordination-k8s-io . Seemingly it was designed to replace the configmap based leader election. Perhaps it can used for an alternative implementation?
@mindw This issue isn't related to the use of a configmap vs lease api. We are using a configmap to store the clusterdata and while we have it also for sentinel leader election, the lease api won't work since we also have to store the configdata and we'll need a configmap anyway.
This issue is related to the fact that the keepers and pods write their status to their own pod metadata and this is reflected when one watches for pod changes. I personally think that this isn't an issue and should be fixed on the kubectl/kube api side by providing a way to filter updates types. But if you want you can implement something like the one I proposed at the end of this https://github.com/sorintlab/stolon/issues/463#issuecomment-379666733