source-controller
source-controller copied to clipboard
[GitHub] Handshake failed: knownhosts: key mismatch
Started getting these errors out of the blue on all clusters.
{"level":"error","ts":"2021-11-16T18:21:07.474Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://[email protected]/user/repository', error: ssh: handshake failed: knownhosts: key mismatch"}
Doing find -name known_hosts
in the pod produces nothing.
Restarting the pod = same error immediately.
What's going on, where's the known_hosts
file?
What's going on, where's the known_hosts file?
The known_hosts file is in the same secret as the SSH key, please see the docs here https://fluxcd.io/docs/components/source/gitrepositories/#ssh-authentication
I'm getting the same error on my cluster:
✗ GitRepository reconciliation failed: 'unable to clone 'ssh://[email protected]/stefanprodan/my-demo-fleet': ssh: handshake failed: knownhosts: key mismatch'
Looks like an issue with GitHub host keys.
I am also seeing this error in the last 30 minutes on 3 clusters that had been previously working fine
According to: https://github.blog/2021-09-01-improving-git-protocol-security-github/
Today is the day that host keys get rotated at GitHub. There are two new host keys in the blog post, one for ECDSA and another for Ed25519.
Ok so rotating the SSH key fixes it.
Before:
$ k -n flux-system get secret flux-system -o json | jq '.data | map_values(@base64d)'
{
"identity": "-----BEGIN PRIVATE KEY-----\n",
"identity.pub": "ecdsa-sha2-nistp384 \n",
"known_hosts": "github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ=="
}
After:
{
"identity": "-----BEGIN PRIVATE KEY-----\n",
"identity.pub": "ecdsa-sha2-nistp384 \n",
"known_hosts": "github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg="
}
The known_hosts file is in the same secret as the SSH key, please see the docs here https://fluxcd.io/docs/components/source/gitrepositories/#ssh-authentication
Cool, thanks, but I do see the "old" keys when doing keyscan on the nodes. Somehow only the pods see the "new" ones. It makes sense though.
GitHub has changed its SSH host keys from DSA to ECDSA! https://github.blog/2021-09-01-improving-git-protocol-security-github/
To fix the key mismatch error, you have two options:
Update the known_hosts
in the flux-system
secret with the ecdsa-sha2-nistp25
value:
github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg=
Or rotate the SSH keys with flux boostrap
like so:
- delete the deploy key secret from your cluster
kubectl -n flux-system delete secret flux-system
- rerun
flux bootstrap github
with the same arguments as before - Flux will generate the secret with
ecdsa-sha2
SSH key and Host key
Updated known_hosts
in flux-system
secret manually everywhere.
Seems to work now.
If you'd like a short program to do it:
#!/usr/bin/env bash
set -e -u -o pipefail
# NB: The Ed25519-format key does not work with Flux.
for secret_name in flux-system repo-2 repo-3; do
kubectl --namespace=flux-system \
patch secret "${secret_name}" \
--patch='
stringData:
known_hosts: >
github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg='
done
kubectl --namespace=flux-system rollout restart deployment source-controller
kubectl --namespace=flux-system rollout status deployment/source-controller --watch
Confirmed. Working for us now as well after deleting the secret and bootstrapping again.
@seh the secret is not mounted inside source-controller, instead the controller reads the secret from Kubernetes API before each Git operations. I don't think you need rollout restart.
I was finding that it sits in what appears to be due to a backed-off timer, such that it won't try again for a while after several consecutive failures, but restarting it caused it to try again immediately.
Variant on the above script: https://gist.github.com/ellieayla/76352313c4f5939db6d2268fb70b0d48
Then either wait or request each GitRepository to reconcile.
Confirm that we are getting this on our cluster as well suddenly.
Note with libgit2, the reported error is unable to clone: Certificate
ala fluxcd/source-controller#397 and fluxcd/source-controller#433.
@stefanprodan maybe add to the comment that if you edit the secrets manually, you should restart the source-controller
after updating the secret, otherwise source-controller
might overwrite the secret with the old values.
We've stopped the source-controller
before updating the secrets and then started it again just to be safe:
kubectl scale deploy/source-controller --replicas=0
update the secrets
kubectl scale deploy/source-controller --replicas=1
Edit: the old ssh-rsa
value gets added back somehow. Maybe kustomize-controller
also needs to be restarted.
otherwise source-controller might overwrite the secret with the old values.
source-controller doesn't alter secrets. It can't even do that, our RBAC allows the controller read-only access to secrets.
Edit: the old ssh-rsa value gets added back somehow. Maybe kustomize-controller also needs to be restarted.
You clearly don't use bootstrap or you've stored the SSH keys in Git. If so, then update the secret in Git as well.
Unfortunately, this was a predictable incident. It felt wrong to me, as a Flux user, to be providing a known hosts entry as part of the terraform bootstrap process (from this example) for precisely this reason.
To prevent another incident of similar scale in the future, why not give the source-controller
the responsibility of maintaining the known hosts file? Presumably given the urls of the sources it has to reconcile it should be fairly straight forward to use something like ssh-keyscan
to keep the file up to date?
It felt wrong to me, as a Flux user, to be providing a known hosts entry as part of the bootstrap process for precisely this reason.
Bootstrap does no such thing, Flux itself generates the known_hosts entries. As a Flux user, you are never asked to provide host keys.
Are multiple known_hosts
with different algorithms supported by the go-git
implementation?
Bootstrap does no such thing, Flux itself generates the known_hosts entries. As a Flux user, you are never asked to provide host keys. @stefanprodan this example from the flux terraform provider examples certainly does.
@rtjfarrimond I was referring to flux bootstrap
not Terraform.
I understand, but to be clear, in my original comment I was referring to the terraform bootstrap process. Updated the original comment to reflect this.
To prevent another incident of similar scale in the future, why not give the
source-controller
the responsibility of maintaining the known hosts file?
How can a known_hosts
file, that is used as a trust storage, be automatically maintained by a service? That would render the known_hosts
useless and allow any MITM-attacks to happen.
We have two git sources, flux-system
and flux-manifests
. We've updated the known_hosts
for both but for flux-manifests
the known_hosts
keeps getting replaced with the ssh-rsa
key:
{
"level": "debug",
"ts": "2021-11-17T10:28:10.304Z",
"logger": "events",
"msg": "Normal",
"object": {
"kind": "Kustomization",
"namespace": "flux-system",
"name": "flux-system",
"uid": "138b16f7-ca30-458e-a0b1-811b2900fa2c",
"apiVersion": "kustomize.toolkit.fluxcd.io/v1beta2",
"resourceVersion": "189896097"
},
"reason": "info",
"message": "Secret/flux-system/flux-manifests configured"
}
Is known_hosts
getting updated by the libgit2
callback ?
Sorry, my bad. It looks like we have the secrets for flux-manifests
in Git and flux is just reconciling the secrets.
The Secret files are not managed or written to by any of the controllers, but only used for read operations. If something is overwriting your Secret, it must come from something within your configuration.
How can a
known_hosts
file, that is used as a trust storage, be automatically maintained by a service? That would render theknown_hosts
useless and allow any MITM-attack to happen.
If the some process were to update the known_hosts runs on the same box with the same user that uses the known_hosts file, where would the vector for a MITM be?
By it automatically accepting the offered keys.
If your network is compromised and hostname.com
suddenly starts serving traffic from compromised.com
with a different host key, which is then automatically excepted by the controller, checking the host key no longer has any value.