kube-fledged
kube-fledged copied to clipboard
TLS errors when updating kube-fledged helm chart
When the kube-fledged
helm chart is redeployed, if the changes don't cause the webhook-server
component to restart, any ImageCache
operations start failing with
Internal error occurred: failed calling webhook "validate-image-cache.kubefledged.io": failed to call webhook: Post "https://kube-fledged-webhook-server.kube-system.svc:3443/validate-image-cache?timeout=1s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubefledged.io")
It looks like this is because the webhook CA bundle is hardcoded in the helm chart, but when the webhook server is started, init-server
generates a new CA bundle and updates the webhook configuration. When another deployment occurs, the original CA bundle is reapplied, and the webhook requests begin to fail, until the webhook component is restarted again to patch the bundle
Is there a best practice for keeping that CA bundle configured appropriately? Would support for an existing Certificate secret make sense?
Steps to reproduce
Install base helm chart
helm install kube-fledged kubefledged-charts/kube-fledged -n kube-system --wait
Deploy a simple image cache
echo 'apiVersion: kubefledged.io/v1alpha2
kind: ImageCache
metadata:
name: vault
namespace: kube-system
spec:
cacheSpec:
- images:
- vault:1.9.5
' | kubectl apply -f -
Update the helm chart, with a value that doesn't restart the webhook server
helm upgrade --install kube-fledged kubefledged-charts/kube-fledged -n kube-system --set 'args.controllerImagePullPolicy=IfNotPresent --wait
If you were to update the ImageCache
above, the webhook errors are returned. After restarting the webhook component, they succeed again
@senthilrch In 9259975 the service account for the webhook server has been removed. When I helm upgrade
d from v0.10.0 to (my fork of) v0.10.0 with 9259975 cherry-picked, the webhook server deployment still contained the fields serviceAccount
and serviceAccountName
with the former value (used Helm v3.11.2). This happens because Helm patches the existing deployment manifest (unless using --force
), this does not remove field serviceAccount
(deprecated but kept in sync with serviceAccountName
) and K8s re-populates serviceAccountName
from serviceAccount
. Consequently, the new pod could not be started because the service account did not exist anymore.
There are two option to fix this:
-
Add the following to the Helm template:
# ensure helm upgrade deletes the formerly used fields serviceAccount: "" serviceAccountName: ""
-
Revert the removal of the service account. In general it's a good practice to use a dedicated service account instead of
default
.
I suggest option 2.
@senthilrch any updates on this ?
Hi there. @senthilrch any updates related to the fix of this issue?
I'm using zarf
to deploy/re-deploy kube-fledged
packages, so, in concept, you can do a fresh installation of the helm
charts by removing the previous installation with zarf
and then creating a new installation package again with zarf
, that should be able to deploy everything related to kube-fledged
from zero, including the webhook server
. I'm having the same issue.
Thanks
I've found a simple workaround to this issue:
Add the following configs to the values.yaml
to disable the webhook server
and the validation webhook
# Disable webhook server and validation webhook
webhookServer:
enable: false
validatingWebhook:
# Specifies whether a validating webhook configuration should be created
create: false
This is probably not the best solution, but as I've seen in the code and also in the Make
file in the deploy-using-yaml
option, this is a very known issue and the validation is probably not 100% required.
status:
completionTime: "2023-08-01T14:59:51Z"
message: All requested images pulled succesfully to respective nodes
reason: ImageCacheCreate
startTime: "2023-08-01T14:59:42Z"
status: Succeeded