k8s-image-swapper icon indicating copy to clipboard operation
k8s-image-swapper copied to clipboard

panic if tls: bad certificate

Open estahn opened this issue 4 years ago • 9 comments

On the first run, it sometimes shows up with this. This should cause panic and restart of the process so it can pick up the certificate. I assume a race condition with the helm hook. Need to investigate.

2021/01/04 03:06:21 http: TLS handshake error from 100.96.2.0:30084: remote error: tls: bad certificate
2021/01/04 03:06:25 http: TLS handshake error from 100.96.2.0:9203: remote error: tls: bad certificate
2021/01/04 03:06:27 http: TLS handshake error from 100.96.2.0:13730: remote error: tls: bad certificate
2021/01/04 03:06:29 http: TLS handshake error from 100.96.2.0:9523: remote error: tls: bad certificate
2021/01/04 03:06:29 http: TLS handshake error from 100.96.2.0:32653: remote error: tls: bad certificate
2021/01/04 03:06:30 http: TLS handshake error from 100.96.2.0:33301: remote error: tls: bad certificate
2021/01/04 03:06:30 http: TLS handshake error from 100.96.2.0:6242: remote error: tls: bad certificate

estahn avatar Jan 04 '21 03:01 estahn

To add to this @estahn, you're along the right lines with I assume a race condition with the helm hook, we were finding errors on the patch job that the serviceaccount didn't exist, because it was being removed as part of the hook before the job had been created, and so failing to run.

I've not had a chance to investigate further, but thought I'd mention this.

adamstrawson avatar Apr 12 '21 11:04 adamstrawson

@adamstrawson Have you seen this issue come up after applying this fix (https://github.com/estahn/charts/commit/25cb0ca888f449d29e324bc32da636019e94a96c)?

estahn avatar Aug 16 '21 04:08 estahn

We just faced the same issue yesterday, turned out the certificate had wrong dns names (subjet alternative names) what did not match the webhook url (we use cert manager btw). I would rather have a better error message to help people identify the real problem faster and easier.

project0 avatar Nov 18 '22 10:11 project0

Seeing the same issue:

image-swapper running on AWS spot instances, which caused it to be rescheduled on a new node during spot instance shutdown. Stuck in the "bad certificate loop" -> a vital GitLab CI job failed, because an image was not available in upstream anymore.

Bummer, as we use k8s-image-swapper to protect us from this exact scenario, but ATM we cannot be certain that image swapper is running correctly. Since the pod is still running, our monitoring will also not report any issues.

Jasper-Ben avatar Sep 08 '23 10:09 Jasper-Ben

On further investigation, I believe in our case this is actually an issue with reloading a new cert-manager certificate:

  1. Pod gets certificate via secret from cert-manager which is valid until date x.
  2. At date x-y: cert-manager replaces soon-to-expire certificate and updates secret.
  3. k8s-image-swapper does NOT reload the updated certificate from file.
  4. At date x: old certificate expires -> k8s-image-swapper runs into this issue

Jasper-Ben avatar Sep 08 '23 12:09 Jasper-Ben

@Jasper-Ben Thanks for investigating this. We could possibly use https://github.com/dyson/certman to circumvent this issue. If you have time to contribute that would be amazing, otherwise, I will see if I can squeeze this in ASAP.

estahn avatar Sep 08 '23 13:09 estahn

:wave: @estahn,

We could possibly use https://github.com/dyson/certman to circumvent this issue.

I had a look at certman. There is an open issue, which seems relevant to this use-case: https://github.com/dyson/certman/issues/2 There is also an open PR to address this but it hasn't been merged since end 2021: https://github.com/dyson/certman/pull/1

Might be easier to go the "Kubernetes way" of just panicking, thus triggering a pod recreation?

Or to put it this way: IMO the "panic on TLS error" (as this ticket describes) should happen in any case, to catch any odd misbehavior. If we, in addition, want to be fancy about certificate rotation then we could look into some reload logic.

If you have time to contribute that would be amazing, otherwise, I will see if I can squeeze this in ASAP.

I might be able to take a look at it, but can't promise anything right now (busy schedule, you know how it is). If I manage, I'll let you know, otherwise feel free if you find the time :slightly_smiling_face:

Jasper-Ben avatar Sep 08 '23 14:09 Jasper-Ben

@Jasper-Ben Fair enough.

This is related and can probably used as guidance: https://github.com/golang/go/issues/38877#issuecomment-626906346

estahn avatar Sep 08 '23 14:09 estahn

FWIW, it might be easier for you to use https://github.com/stakater/Reloader, that can trigger a pod restart when the secret behind the cert changes.

martin31821 avatar Sep 11 '23 15:09 martin31821