k8s-image-swapper panic if tls: bad certificate

On the first run, it sometimes shows up with this. This should cause panic and restart of the process so it can pick up the certificate. I assume a race condition with the helm hook. Need to investigate.

2021/01/04 03:06:21 http: TLS handshake error from 100.96.2.0:30084: remote error: tls: bad certificate
2021/01/04 03:06:25 http: TLS handshake error from 100.96.2.0:9203: remote error: tls: bad certificate
2021/01/04 03:06:27 http: TLS handshake error from 100.96.2.0:13730: remote error: tls: bad certificate
2021/01/04 03:06:29 http: TLS handshake error from 100.96.2.0:9523: remote error: tls: bad certificate
2021/01/04 03:06:29 http: TLS handshake error from 100.96.2.0:32653: remote error: tls: bad certificate
2021/01/04 03:06:30 http: TLS handshake error from 100.96.2.0:33301: remote error: tls: bad certificate
2021/01/04 03:06:30 http: TLS handshake error from 100.96.2.0:6242: remote error: tls: bad certificate

Jan 04 '21 03:01 estahn

To add to this @estahn, you're along the right lines with I assume a race condition with the helm hook, we were finding errors on the patch job that the serviceaccount didn't exist, because it was being removed as part of the hook before the job had been created, and so failing to run.

I've not had a chance to investigate further, but thought I'd mention this.

Apr 12 '21 11:04 adamstrawson

@adamstrawson Have you seen this issue come up after applying this fix (https://github.com/estahn/charts/commit/25cb0ca888f449d29e324bc32da636019e94a96c)?

Aug 16 '21 04:08 estahn

We just faced the same issue yesterday, turned out the certificate had wrong dns names (subjet alternative names) what did not match the webhook url (we use cert manager btw). I would rather have a better error message to help people identify the real problem faster and easier.

Nov 18 '22 10:11 project0

Seeing the same issue:

image-swapper running on AWS spot instances, which caused it to be rescheduled on a new node during spot instance shutdown. Stuck in the "bad certificate loop" -> a vital GitLab CI job failed, because an image was not available in upstream anymore.

Bummer, as we use k8s-image-swapper to protect us from this exact scenario, but ATM we cannot be certain that image swapper is running correctly. Since the pod is still running, our monitoring will also not report any issues.

Sep 08 '23 10:09 Jasper-Ben

On further investigation, I believe in our case this is actually an issue with reloading a new cert-manager certificate:

Pod gets certificate via secret from cert-manager which is valid until date x.
At date x-y: cert-manager replaces soon-to-expire certificate and updates secret.
k8s-image-swapper does NOT reload the updated certificate from file.
At date x: old certificate expires -> k8s-image-swapper runs into this issue

Sep 08 '23 12:09 Jasper-Ben

@Jasper-Ben Thanks for investigating this. We could possibly use https://github.com/dyson/certman to circumvent this issue. If you have time to contribute that would be amazing, otherwise, I will see if I can squeeze this in ASAP.

Sep 08 '23 13:09 estahn

:wave: @estahn,

We could possibly use https://github.com/dyson/certman to circumvent this issue.

I had a look at certman. There is an open issue, which seems relevant to this use-case: https://github.com/dyson/certman/issues/2 There is also an open PR to address this but it hasn't been merged since end 2021: https://github.com/dyson/certman/pull/1

Might be easier to go the "Kubernetes way" of just panicking, thus triggering a pod recreation?

Or to put it this way: IMO the "panic on TLS error" (as this ticket describes) should happen in any case, to catch any odd misbehavior. If we, in addition, want to be fancy about certificate rotation then we could look into some reload logic.

If you have time to contribute that would be amazing, otherwise, I will see if I can squeeze this in ASAP.

I might be able to take a look at it, but can't promise anything right now (busy schedule, you know how it is). If I manage, I'll let you know, otherwise feel free if you find the time :slightly_smiling_face:

Sep 08 '23 14:09 Jasper-Ben

@Jasper-Ben Fair enough.

This is related and can probably used as guidance: https://github.com/golang/go/issues/38877#issuecomment-626906346

Sep 08 '23 14:09 estahn

FWIW, it might be easier for you to use https://github.com/stakater/Reloader, that can trigger a pod restart when the secret behind the cert changes.

Sep 11 '23 15:09 martin31821

k8s-image-swapper k8s-image-swapper copied to clipboard

panic if tls: bad certificate

k8s-image-swapper
k8s-image-swapper copied to clipboard