[v2] Multiple crashes of the router pod before network stabilizes
Describe the bug Using a scripted two-site demonstration setup (script included below), the sites initialize but the router pods go into crash-loop-backoff and restart twice before the network finally stablizes.
How To Reproduce On a cluster, create namespaces demo-dmz-a and demo-dmz-b and run the provided setup script. Watch the pods and sites to observe crashes prior to network stabilization.
Expected behavior I expect the network to stabilize in an orderly fashion without seeing crash indications.
Environment details
- Skupper CLI: None used
- Skupper Operator (if applicable): head of the v2 branch
- Platform: openshift
Additional context The error seen in the router log prior to crashing:
2024-10-11 16:02:24.121816 +0000 ROUTER (critical) Router start-up failed: Python: CError: Configuration: Failed to configure TLS caCertFile '/etc/skupper-router-certs/skupper-site-server/ca.crt' from sslProfile 'skupper-site-server'
The script used to reproduce:
NS_PREFIX=demo
for i in dmz-a dmz-b; do
echo Create site $i in namespace ${NS_PREFIX}-$i
cat >temp <<EOF
apiVersion: skupper.io/v1alpha1
kind: Site
metadata:
name: $i
spec:
routerMode: "interior"
ha: false
---
apiVersion: skupper.io/v1alpha1
kind: RouterAccess
metadata:
name: $i-peer
spec:
generateTlsCredentials: true
issuer: skupper-site-ca
roles:
- name: inter-router
port: 55671
tlsCredentials: skupper-site-server
---
apiVersion: skupper.io/v1alpha1
kind: AccessGrant
metadata:
name: $i-grant
spec:
redemptionsAllowed: 10
expirationWindow: 1h
EOF
kubectl apply -n ${NS_PREFIX}-$i -f temp
echo Generating access token token-$i.yaml
kubectl wait --for=condition=ready accessgrant/$i-grant -n ${NS_PREFIX}-$i
kubectl get accessgrant $i-grant -n ${NS_PREFIX}-$i -o yaml > temp
URL=`cat temp | yq '.status.url'`
CODE=`cat temp | yq '.status.code'`
CA=`cat temp | yq -r '.status.ca' | awk '{ print " " $0 }'`
cat >token-$i.yaml <<EOF
apiVersion: skupper.io/v1alpha1
kind: AccessToken
metadata:
name: $i-token
spec:
url: ${URL}
code: ${CODE}
ca: |
${CA}
EOF
rm temp
echo
done
echo Link dmz-b to dmz-a
kubectl apply -f token-dmz-a.yaml -n ${NS_PREFIX}-dmz-b
For your convenience... Here is the cleanup script to un-do the above reproducer:
NS_PREFIX=demo
for i in dmz-a dmz-b;
do
kubectl delete accessgrant $i-grant -n ${NS_PREFIX}-$i
kubectl delete routeraccess $i-peer -n ${NS_PREFIX}-$i
kubectl delete site $i -n ${NS_PREFIX}-$i
rm token-$i.yaml
done
kubectl delete accesstoken dmz-a-token -n ${NS_PREFIX}-dmz-b
Possible root cause:
The router has a new behavior post-3.0.0 (I was running the latest in this test). SslProfiles now load the referenced certificate files immediately upon configuration. The old behavior was to load the certificates upon connection-startup for every new connection.
This means that before an sslProfile is created, all of its referenced files must already exist in the file system.
In Skupper, the config-sync module stores the current configuration, including slProfiles, in the router's config-map. When the router starts up, it will read the configuration mounted from that config-map as the initial configuration. The certificate files, however, are not mounted into the router container. They are copied at run-time into a shared file system by the config-sync container.
This means that there is a race condition at pod-startup. If the router reads its initial configuration before config-sync can store the certificate files, the router will shut down due to the incomplete configuration.
@ted-ross I believe this is fixed now, do you agree or are you still seeing issues?