skupper icon indicating copy to clipboard operation
skupper copied to clipboard

Network Observer connection errors after recreating a site

Open c-kruse opened this issue 10 months ago • 4 comments

Describe the bug When a network-observer deployment is running in a namespace with a site and the Site CR is deleted and recreated/replaced, the network-observer fails to reconnect to the new site.

The network-observer container logs these connection errors.

2025/03/11 20:38:55 ERROR session error on collector container component=collector error="session error: dial error: tls: failed to verify certificate: x509: certificate signed by unknown authority" delay=27.227748702s

Root Cause: The skupper-network-observer-client Certificate.skupper.io resource is managed by the network-observer deployment, and has an explicit reference to the "skupper-local-ca" Certificate that the skupper-controller creates as a child of a Site Record implicitly. The network-observer uses this certificate to connect to the router. When a Site is deleted, the controller knows to clean up the Site's children - including the skupper-local-ca certificate. When a new Site is created, the controller generates a new skupper-local-ca, but does not touch the existing TLS credentials for the skupper-network-observer-client Certificate. This means the network-observer and the router are now using different CA certs and no longer have trust.

Easy workaround

kubectl delete secret skupper-network-observer-client
kubectl rollout restart deployment/skupper-network-observer

How To Reproduce

# install site
skupper site create t01
helm install \
  skupper-network-observer oci://quay.io/skupper/helm/network-observer \
  --version 2.0.0 \
  --wait

# [informational] assert that the network observer client CA matches the local CA cert.
localCA=$(kubectl get secrets skupper-local-ca -ojsonpath='{.data.tls\.crt}')
observerCA=$(kubectl get secrets skupper-network-observer-client -ojsonpath='{.data.ca\.crt}')
if [ "$localCA" != "$observerCA" ]; then echo "Unexpected error"; fi

# delete site and recreate another
kubectl delete site t01
skupper site create t02

# observe that the network-observer certificate still matches the old site, not the new one.
localCA=$(kubectl get secrets skupper-local-ca -ojsonpath='{.data.tls\.crt}')
observerCA=$(kubectl get secrets skupper-network-observer-client -ojsonpath='{.data.ca\.crt}')
if [ "$localCA" != "$observerCA" ]; then echo "ERROR - network-observer will not be able to connect to network!"; fi

# Workaround by deleting the secret and restarting the network-observer deployment
kubectl delete secret skupper-network-observer-client
kubectl rollout restart deployment/skupper-network-observer

Expected behavior

TBD. I think we should think carefully about this - we probably don't want to blow away credentials willy-nilly.

Additional context

c-kruse avatar Mar 11 '25 21:03 c-kruse

@grs any thoughts here?

My initial thought is there's probably something the controller could reasonably do here. Maybe something like cascading deletes of Certificates with signing=true to secrets that are signed by that cert. Otherwise maybe the controller absorbs some of this and the network-observer deployment depends on a well-known client certificate as opposed to a CA cert?

c-kruse avatar Mar 11 '25 21:03 c-kruse

Initial options that occur:

(1) I agree the controller could easily create a client secret with a fixed name that can be used to connect to the router

(2) Could the skupper-network-observer-client Certificate include the CA Certificate it is signed by in its owner references? That way when the CA is deleted, the client certificate will be also.

(3) The controller could add the CA (certificate or secret) as an owner reference to the secret generated for any server or client certificate. (Essentially automating (2)). The question is whether there is ever a case you would not want this. I can't think of one. If the CA used to sign the certificate is deleted, you will most likely want to regenerate.

(4) The controller could alter the check it performs on generated secrets. It currently checks expiration, valid hosts etc and will regenerate if those do not match expectations. It would be reasonable to include checking the signature as well.

grs avatar Mar 12 '25 08:03 grs

(1) I think would be pretty uncomplicated on the network-observer side. We'd just drop the v2alpha1 Certificate resource and depend on a tls secret instead.

(2) This would imply some sort of operator to get the site CA certificate uid to reference in the network-observer's client Certificate, which doesn't lend itself super well to a static deployment. We could maybe sneak it in as another pre-install Job like we do with generating auth credentials by exception today, but that wouldn't solve for replacing the certificate after it is deleted.

(3+4) I think these are interesting. Especially if it is hooked up to clean up just the secret, leaving the server/client Certificate in place falling back on Not Ready with a "CA not found" message. Do we already have a good idea of how certificate rotation is going to end up working? If we were going to find our edge case where one of these behaviors was unwanted, my guess is we'd find it there.

c-kruse avatar Mar 12 '25 16:03 c-kruse

(2) This would imply some sort of operator to get the site CA certificate uid to reference in the network-observer's client Certificate, which doesn't lend itself super well to a static deployment. We could maybe sneak it in as another pre-install Job like we do with generating auth credentials by exception today, but that wouldn't solve for replacing the certificate after it is deleted.

Yeah, on reflection I think that is a bad option. Ultimately I think 3 and/or 4 are both 'the right thing' from the perspective of the certificate api. Once one or other of those was done, I think the problem would resolve itself.

(1) is also easy from the controllers perspective, and isn't mutually exclusive to 3/4, though if 1 was implemented the network-observer wouldn't really care too much about 3/4.

grs avatar Mar 12 '25 20:03 grs