Error if a Site is created but old configuration is still present
In case a Site is deleted and recreated quickly (eg: through an automation), the skupper-router ConfigMap owned by the previous site, may still be present.
The controller now fails, if it finds a router configuration that is not owned by the currently active site.
Fixes #2323.
@fgiorgetti I don't know what is this exact issue is, however if you look at below output when i ran skupper site deletion command i did check all the objects and everything was completely clean.
$ 1188 [Sun 23 2:31PM] ip-192-168-1-6 :~/Desktop/isd-pl-f07e95fc0d9a skupper site delete --all -n test-site-new
Waiting for deletion to complete...
Site "test-site-new" is deleted
$ oc get site
No resources found in test-site-new namespace.
$ oc get pods
No resources found in test-site-new namespace.
$ oc get secret
NAME TYPE DATA AGE
all-icr-io kubernetes.io/dockerconfigjson 1 36m
builder-dockercfg-8b9t7 kubernetes.io/dockercfg 1 36m
default-dockercfg-6s5xw kubernetes.io/dockercfg 1 36m
deployer-dockercfg-57xwx kubernetes.io/dockercfg 1 36m
pipeline-dockercfg-8mbn6 kubernetes.io/dockercfg 1 36m
skupper-router-dockercfg-ctjvr kubernetes.io/dockercfg 1 35m
$ oc get cm
NAME DATA AGE
config-service-cabundle 1 36m
config-trusted-cabundle 1 36m
kube-root-ca.crt 1 36m
openshift-service-ca.crt 1 36m
I am happy that you were able to reproduce this issue. Unfortunately i was unable to reproduce this in our lower environments, this is happening only in our production environment.
I am happy that you were able to reproduce this issue. Unfortunately i was unable to reproduce this in our lower environments, this is happening only in our production environment.
@vsomwanshi I was able to reproduce it, when I quickly delete/create a site, like in an automated way through a script.
What happened was that once a site is deleted and another site is created, the site that is created is being processed before the old resources, owned by the deleted site, have been removed, causing that error, which as you pointed out in the issue, can be recovered if you restart the skupper-controller pod.
Can you share some details on the procedure you guys are following in production to reproduce it? Is it possible that you guys have 2 sites created on the same namespace at the time you're deleting it? This could potentially be a similar trigger to that. Or eventually once you remove a site, is there any gitops operator applying a new site definition?
@fgiorgetti Please find comments inline;
Can you share some details on the procedure you guys are following in production to reproduce it? --> Following below steps to reproduce the issue;
## Method 1:
- Delete skupper site from CLI using command : skupper site delete --all -n <namespace>
- Wait for some time to get Site object as well other relative components deleted.
- Sync the Site yaml configuration ( Site Object ) from gitops which will eventually create all the relative objects as well.
## Method 2:
- Delete skupper site object from gitops
- Wait for some time to get Site object as well other relative components deleted.
- Sync the Site yaml configuration ( Site Object ) from gitops eventually which will eventually create all the relative objects as well.
is it possible that you guys have 2 sites created on the same namespace at the time you're deleting it? This could potentially be a similar trigger to that. Or eventually once you remove a site.
--> We have 1:1 mapping, we are creating one skupper site only in one namespace. Other thing is anyway skupper controller will not allow you to create another site in the same namespace when Site object is already present in the namespace.
is there any gitops operator applying a new site definition?
--> Yes, we have entire setup of skupper through gitops only. Skupper controller, CRD's and sizing profile configmap's are deployed in one dedicated namespace. Site's are deployed in separate namespaces. In our production environment we have 55 skupper site created in one OpenShift cluster. Each site has 14 listeners and 5 connectors.
Not sure but somehow i am unable to reproduce this issue in our lower environments. Would it be happening in production because as mentioned in above comment we have 55 skupper sites created in one OpenShift cluster and each site has 14 listeners and 5 connectors. is it creating more events and due to which skupper-controller is unstable or unable to identify the site cleanups operations etc etc ?
@fgiorgetti or anyone of you can answer this; so this fix you are applying would be part of the latest release, right ? may be skupper 2.1.3 ? i could see lot of issues your team has fixed and i would need to rollout them in our environments near in future.
If i need to go with this release in future in our environments.
[1] During upgrade phase from 2.1.0 to 2.1.3 i just need to simply upgrade the skupper controller to 2.1.3, rest of the things would be completely taken care by controller itself (e.g upgrade skupper-router, kube-adaptor etc etc ) ?
[2] I believe no downtime required for this upgrade process but just for confirmation i am asking so i can accordingly take it to management.
[3] No need to touch the site's as well as skupper link recreation also not required.
Thank you.
@fgiorgetti or anyone of you can answer this; so this fix you are applying would be part of the latest release, right ? may be skupper
2.1.3? i could see lot of issues your team has fixed and i would need to rollout them in our environments near in future.
Yes, the idea is that this fix will be included as part of the 2.1.3. But I am still waiting on more feedback from reviewers.
[1] During upgrade phase from
2.1.0to2.1.3i just need to simply upgrade the skupper controller to2.1.3, rest of the things would be completely taken care by controller itself(e.g upgrade skupper-router, kube-adaptor etc etc )?
Correct.
[2] I believe no downtime required for this upgrade process but just for confirmation i am asking so i can accordingly take it to management.
There is a downtime, as once the controller is updated, it will also upadate all your sites, so the skupper-router deployment on each namespace will be updated as well, causing a restart.
[3] No need to touch the site's as well as skupper link recreation also not required.
Exactly. All existing sites and configuration are preserved.