chore(backend): Add documentation for TLS cert rotation
Chore description
When pod-to-pod TLS (option recently added in #12082) is enabled, certs must be renewed several times per year. When secrets are updated or renewed, the user will need to restart their cluster. This is currently not documented and is not obvious to the user. Documentation explaining the TLS cert renewal process, including successfully applying the certs and restarting the cluster, should be added in the backend README.
Labels
/area backend
Love this idea? Give it a 👍.
Nice work! Maybe we could also include an example command sequence for renewing the certs and restarting the cluster to make it even clearer for users.
Proposed Approach
Hi @alyssacgoins @aniketpati1121 ! I’d like to work on this issue. Here is the approach I plan to take — please let me know if this aligns with the maintainers’ expectations:
- Understand the Existing TLS Feature Context
- Review the implementation added in #12082 (pod-to-pod TLS support).
- Identify which backend services depend on the TLS cert secret (API server, persistence agent, cache server, metadata writer, etc.).
- Locate the specific Kubernetes Secret(s) used for TLS certs (usually kfp-pod-tls or similar).
- Document the TLS Cert Lifecycle
In the backend README, I plan to add a new subsection titled: “TLS Certificate Rotation (Pod-to-Pod TLS)”.
This section will cover:
- Why cert rotation is required (expiration every X months).
- Which secrets contain the certs.
- What parts of KFP rely on them.
- A clear explanation that KFP pods do not automatically reload TLS secrets, so a restart is required.
- Provide a Step-by-Step Renewal Procedure
Add a concise, copy-paste-ready sequence for users, for example:
- Generate or obtain renewed TLS certificates
- Update the Kubernetes secret Example :
kubectl create secret tls kfp-pod-tls \
--cert=server.crt \
--key=server.key \
--dry-run=client -o yaml | kubectl apply -f -
- Restart affected KFP components, such as:
kubectl rollout restart deploy -n kubeflow pipelines-api-server
kubectl rollout restart deploy -n kubeflow pipelines-persistenceagent
kubectl rollout restart deploy -n kubeflow pipelines-metadata-writer
kubectl rollout restart deploy -n kubeflow pipelines-cache
- Verify the rollout
kubectl get pods -n kubeflow
- Add Notes & Best Practices
- Recommended rotation intervals
- Suggest using cert-manager if available
- What errors users might see when expired certs are used
- How to confirm cert was successfully loaded after rotation
- Final Formatting & Linking
- Cross-link back to the pod-to-pod TLS feature PR (#12082).
- Insert this new section near other backend operational notes.
- Ensure consistent formatting with other documentation sections.
If this approach looks good, I can start implementing the documentation update. Thanks!
@alyssacgoins @aniketpati1121 Are there any updates on this issue? I would love to work on it please let me know if it is available.
Hi @alyssacgoins @aniketpati1121 I have opened a PR addressing this issue: #12457.
It includes the full TLS certificate rotation documentation along with optional helper scripts as discussed. Please take a look whenever you get a chance happy to apply any changes or improvements you suggest!
Thanks!
Hey @rahul810050 thanks for taking this issue on! Your plan looks good to me, and I'll review your PR.