source-controller icon indicating copy to clipboard operation
source-controller copied to clipboard

[GitHub] Handshake failed: knownhosts: key mismatch

Open pkit opened this issue 3 years ago • 43 comments

Started getting these errors out of the blue on all clusters.

{"level":"error","ts":"2021-11-16T18:21:07.474Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://[email protected]/user/repository', error: ssh: handshake failed: knownhosts: key mismatch"}

Doing find -name known_hosts in the pod produces nothing. Restarting the pod = same error immediately. What's going on, where's the known_hosts file?

pkit avatar Nov 16 '21 18:11 pkit

What's going on, where's the known_hosts file?

The known_hosts file is in the same secret as the SSH key, please see the docs here https://fluxcd.io/docs/components/source/gitrepositories/#ssh-authentication

stefanprodan avatar Nov 16 '21 18:11 stefanprodan

I'm getting the same error on my cluster:

✗ GitRepository reconciliation failed: 'unable to clone 'ssh://[email protected]/stefanprodan/my-demo-fleet': ssh: handshake failed: knownhosts: key mismatch'

Looks like an issue with GitHub host keys.

stefanprodan avatar Nov 16 '21 18:11 stefanprodan

I am also seeing this error in the last 30 minutes on 3 clusters that had been previously working fine

kmannuz avatar Nov 16 '21 18:11 kmannuz

According to: https://github.blog/2021-09-01-improving-git-protocol-security-github/

Today is the day that host keys get rotated at GitHub. There are two new host keys in the blog post, one for ECDSA and another for Ed25519.

kingdonb avatar Nov 16 '21 18:11 kingdonb

Ok so rotating the SSH key fixes it.

Before:

$ k -n flux-system get secret flux-system -o json | jq '.data | map_values(@base64d)'
{
  "identity": "-----BEGIN PRIVATE KEY-----\n",
  "identity.pub": "ecdsa-sha2-nistp384 \n",
  "known_hosts": "github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ=="
}

After:

{
  "identity": "-----BEGIN PRIVATE KEY-----\n",
  "identity.pub": "ecdsa-sha2-nistp384 \n",
  "known_hosts": "github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg="
}

stefanprodan avatar Nov 16 '21 18:11 stefanprodan

The known_hosts file is in the same secret as the SSH key, please see the docs here https://fluxcd.io/docs/components/source/gitrepositories/#ssh-authentication

Cool, thanks, but I do see the "old" keys when doing keyscan on the nodes. Somehow only the pods see the "new" ones. It makes sense though.

pkit avatar Nov 16 '21 18:11 pkit

GitHub has changed its SSH host keys from DSA to ECDSA! https://github.blog/2021-09-01-improving-git-protocol-security-github/

To fix the key mismatch error, you have two options:

Update the known_hosts in the flux-system secret with the ecdsa-sha2-nistp25 value:

github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg=

Or rotate the SSH keys with flux boostrap like so:

  • delete the deploy key secret from your cluster kubectl -n flux-system delete secret flux-system
  • rerun flux bootstrap github with the same arguments as before
  • Flux will generate the secret with ecdsa-sha2 SSH key and Host key

stefanprodan avatar Nov 16 '21 18:11 stefanprodan

Updated known_hosts in flux-system secret manually everywhere. Seems to work now.

pkit avatar Nov 16 '21 19:11 pkit

If you'd like a short program to do it:

#!/usr/bin/env bash

set -e -u -o pipefail

# NB: The Ed25519-format key does not work with Flux.
for secret_name in flux-system repo-2 repo-3; do
  kubectl --namespace=flux-system \
          patch secret "${secret_name}" \
          --patch='
stringData:
  known_hosts: >
    github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg='
done

kubectl --namespace=flux-system rollout restart deployment source-controller
kubectl --namespace=flux-system rollout status deployment/source-controller --watch

seh avatar Nov 16 '21 19:11 seh

Confirmed. Working for us now as well after deleting the secret and bootstrapping again.

brianpham avatar Nov 16 '21 19:11 brianpham

@seh the secret is not mounted inside source-controller, instead the controller reads the secret from Kubernetes API before each Git operations. I don't think you need rollout restart.

stefanprodan avatar Nov 16 '21 19:11 stefanprodan

I was finding that it sits in what appears to be due to a backed-off timer, such that it won't try again for a while after several consecutive failures, but restarting it caused it to try again immediately.

seh avatar Nov 16 '21 19:11 seh

Variant on the above script: https://gist.github.com/ellieayla/76352313c4f5939db6d2268fb70b0d48

Then either wait or request each GitRepository to reconcile.

ellieayla avatar Nov 16 '21 21:11 ellieayla

Confirm that we are getting this on our cluster as well suddenly.

poteat avatar Nov 16 '21 23:11 poteat

Note with libgit2, the reported error is unable to clone: Certificate ala fluxcd/source-controller#397 and fluxcd/source-controller#433.

ellieayla avatar Nov 16 '21 23:11 ellieayla

@stefanprodan maybe add to the comment that if you edit the secrets manually, you should restart the source-controller after updating the secret, otherwise source-controller might overwrite the secret with the old values.

We've stopped the source-controller before updating the secrets and then started it again just to be safe:

kubectl scale deploy/source-controller --replicas=0

update the secrets

kubectl scale deploy/source-controller --replicas=1

Edit: the old ssh-rsa value gets added back somehow. Maybe kustomize-controller also needs to be restarted.

ghost avatar Nov 17 '21 10:11 ghost

otherwise source-controller might overwrite the secret with the old values.

source-controller doesn't alter secrets. It can't even do that, our RBAC allows the controller read-only access to secrets.

stefanprodan avatar Nov 17 '21 10:11 stefanprodan

Edit: the old ssh-rsa value gets added back somehow. Maybe kustomize-controller also needs to be restarted.

You clearly don't use bootstrap or you've stored the SSH keys in Git. If so, then update the secret in Git as well.

stefanprodan avatar Nov 17 '21 10:11 stefanprodan

Unfortunately, this was a predictable incident. It felt wrong to me, as a Flux user, to be providing a known hosts entry as part of the terraform bootstrap process (from this example) for precisely this reason.

To prevent another incident of similar scale in the future, why not give the source-controller the responsibility of maintaining the known hosts file? Presumably given the urls of the sources it has to reconcile it should be fairly straight forward to use something like ssh-keyscan to keep the file up to date?

rtjfarrimond avatar Nov 17 '21 10:11 rtjfarrimond

It felt wrong to me, as a Flux user, to be providing a known hosts entry as part of the bootstrap process for precisely this reason.

Bootstrap does no such thing, Flux itself generates the known_hosts entries. As a Flux user, you are never asked to provide host keys.

stefanprodan avatar Nov 17 '21 10:11 stefanprodan

Are multiple known_hosts with different algorithms supported by the go-git implementation?

sebastian-dyroff avatar Nov 17 '21 10:11 sebastian-dyroff

Bootstrap does no such thing, Flux itself generates the known_hosts entries. As a Flux user, you are never asked to provide host keys. @stefanprodan this example from the flux terraform provider examples certainly does.

rtjfarrimond avatar Nov 17 '21 10:11 rtjfarrimond

@rtjfarrimond I was referring to flux bootstrap not Terraform.

stefanprodan avatar Nov 17 '21 10:11 stefanprodan

I understand, but to be clear, in my original comment I was referring to the terraform bootstrap process. Updated the original comment to reflect this.

rtjfarrimond avatar Nov 17 '21 10:11 rtjfarrimond

To prevent another incident of similar scale in the future, why not give the source-controller the responsibility of maintaining the known hosts file?

How can a known_hosts file, that is used as a trust storage, be automatically maintained by a service? That would render the known_hosts useless and allow any MITM-attacks to happen.

hiddeco avatar Nov 17 '21 10:11 hiddeco

We have two git sources, flux-system and flux-manifests. We've updated the known_hosts for both but for flux-manifests the known_hosts keeps getting replaced with the ssh-rsa key:

{
  "level": "debug",
  "ts": "2021-11-17T10:28:10.304Z",
  "logger": "events",
  "msg": "Normal",
  "object": {
    "kind": "Kustomization",
    "namespace": "flux-system",
    "name": "flux-system",
    "uid": "138b16f7-ca30-458e-a0b1-811b2900fa2c",
    "apiVersion": "kustomize.toolkit.fluxcd.io/v1beta2",
    "resourceVersion": "189896097"
  },
  "reason": "info",
  "message": "Secret/flux-system/flux-manifests configured"
}

Is known_hosts getting updated by the libgit2 callback ?

ghost avatar Nov 17 '21 10:11 ghost

Sorry, my bad. It looks like we have the secrets for flux-manifests in Git and flux is just reconciling the secrets.

ghost avatar Nov 17 '21 10:11 ghost

The Secret files are not managed or written to by any of the controllers, but only used for read operations. If something is overwriting your Secret, it must come from something within your configuration.

hiddeco avatar Nov 17 '21 10:11 hiddeco

How can a known_hosts file, that is used as a trust storage, be automatically maintained by a service? That would render the known_hosts useless and allow any MITM-attack to happen.

If the some process were to update the known_hosts runs on the same box with the same user that uses the known_hosts file, where would the vector for a MITM be?

rtjfarrimond avatar Nov 17 '21 11:11 rtjfarrimond

By it automatically accepting the offered keys.

If your network is compromised and hostname.com suddenly starts serving traffic from compromised.com with a different host key, which is then automatically excepted by the controller, checking the host key no longer has any value.

hiddeco avatar Nov 17 '21 11:11 hiddeco