k8s-image-swapper
k8s-image-swapper copied to clipboard
panic if tls: bad certificate
On the first run, it sometimes shows up with this. This should cause panic and restart of the process so it can pick up the certificate. I assume a race condition with the helm hook. Need to investigate.
2021/01/04 03:06:21 http: TLS handshake error from 100.96.2.0:30084: remote error: tls: bad certificate
2021/01/04 03:06:25 http: TLS handshake error from 100.96.2.0:9203: remote error: tls: bad certificate
2021/01/04 03:06:27 http: TLS handshake error from 100.96.2.0:13730: remote error: tls: bad certificate
2021/01/04 03:06:29 http: TLS handshake error from 100.96.2.0:9523: remote error: tls: bad certificate
2021/01/04 03:06:29 http: TLS handshake error from 100.96.2.0:32653: remote error: tls: bad certificate
2021/01/04 03:06:30 http: TLS handshake error from 100.96.2.0:33301: remote error: tls: bad certificate
2021/01/04 03:06:30 http: TLS handshake error from 100.96.2.0:6242: remote error: tls: bad certificate
To add to this @estahn, you're along the right lines with I assume a race condition with the helm hook
, we were finding errors on the patch job that the serviceaccount didn't exist, because it was being removed as part of the hook before the job had been created, and so failing to run.
I've not had a chance to investigate further, but thought I'd mention this.
@adamstrawson Have you seen this issue come up after applying this fix (https://github.com/estahn/charts/commit/25cb0ca888f449d29e324bc32da636019e94a96c)?
We just faced the same issue yesterday, turned out the certificate had wrong dns names (subjet alternative names) what did not match the webhook url (we use cert manager btw). I would rather have a better error message to help people identify the real problem faster and easier.
Seeing the same issue:
image-swapper running on AWS spot instances, which caused it to be rescheduled on a new node during spot instance shutdown. Stuck in the "bad certificate loop" -> a vital GitLab CI job failed, because an image was not available in upstream anymore.
Bummer, as we use k8s-image-swapper to protect us from this exact scenario, but ATM we cannot be certain that image swapper is running correctly. Since the pod is still running, our monitoring will also not report any issues.
On further investigation, I believe in our case this is actually an issue with reloading a new cert-manager certificate:
- Pod gets certificate via secret from cert-manager which is valid until date x.
- At date x-y: cert-manager replaces soon-to-expire certificate and updates secret.
- k8s-image-swapper does NOT reload the updated certificate from file.
- At date x: old certificate expires -> k8s-image-swapper runs into this issue
@Jasper-Ben Thanks for investigating this. We could possibly use https://github.com/dyson/certman to circumvent this issue. If you have time to contribute that would be amazing, otherwise, I will see if I can squeeze this in ASAP.
:wave: @estahn,
We could possibly use https://github.com/dyson/certman to circumvent this issue.
I had a look at certman. There is an open issue, which seems relevant to this use-case: https://github.com/dyson/certman/issues/2 There is also an open PR to address this but it hasn't been merged since end 2021: https://github.com/dyson/certman/pull/1
Might be easier to go the "Kubernetes way" of just panicking, thus triggering a pod recreation?
Or to put it this way: IMO the "panic on TLS error" (as this ticket describes) should happen in any case, to catch any odd misbehavior. If we, in addition, want to be fancy about certificate rotation then we could look into some reload logic.
If you have time to contribute that would be amazing, otherwise, I will see if I can squeeze this in ASAP.
I might be able to take a look at it, but can't promise anything right now (busy schedule, you know how it is). If I manage, I'll let you know, otherwise feel free if you find the time :slightly_smiling_face:
@Jasper-Ben Fair enough.
This is related and can probably used as guidance: https://github.com/golang/go/issues/38877#issuecomment-626906346
FWIW, it might be easier for you to use https://github.com/stakater/Reloader, that can trigger a pod restart when the secret behind the cert changes.