Issue with SSH key on k8s-infra-prow-build cluster for CRI-O nodes
What happened:
Every time we try to run CRI-O tests on the k8s-infra-prow-build cluster, SSH fails. Most recently:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/107361/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1479158719756374016/#1:build-log.txt%3A327
W0106 18:45:27.531] I0106 18:45:27.530809 6869 ssh.go:120] Running the command ssh, with args: [-o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR -i /workspace/.ssh/google_compute_engine [email protected] -- sudo sh -c 'systemctl list-units --type=service --state=running | grep -e docker -e containerd -e crio']
W0106 18:45:27.942] E0106 18:45:27.942276 6869 ssh.go:123] failed to run SSH command: out: [email protected]: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
W0106 18:45:27.942] , err: exit status 255
What you expected to happen:
Tests should successfully be able to ssh.
How to reproduce it (as minimally and precisely as possible):
We've seen this every time we've tried to migrate CRI-O tests to this cluster, e.g. https://github.com/kubernetes/kubernetes/issues/102624 https://github.com/kubernetes/test-infra/pull/24591
I think there is an issue with the SSH key. We keep working around this by avoiding running CRI-O tests on this cluster. Can someone from SIG Infra help out?
Please provide links to example occurrences, if any:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/107361/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1479158719756374016/#1:build-log.txt%3A327
Anything else we need to know?:
/sig node k8s-infra
SIG k8s-infra are most active in https://github.com/kubernetes/k8s.io where most of the infra is managed (including k8s-infra-prow-builds cluster).
The ssh key itself shouldn't matter much, with correct setup the key should be added as trusted to the nodes when doing gcloud compute ssh as long as the GCP user has permission.
The same key should be used for other GCP ssh jobs, so I would imagine more likely this is an issue with the cri-o VM image not supporting the key agent or something.
/assign @haircommander
cc: @ameukam
/kind failing-test /remove-kind bug
from what I can tell this was fixed by https://github.com/kubernetes/test-infra/pull/25080. Can this be closed @ehashman
are we sure? the latest run failed: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1491479336685932544/
I0209 18:43:12.996] unable to create gce instance with running docker daemon for image fedora-coreos-35-20220116-3-0-gcp-x86-64. instance n1-standard-2-fedora-coreos-35-20220116-3-0-gcp-x86-64-2c22c0ee not running docker/containerd/crio daemon - Command failed: [email protected]: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
oops, I was looking in the wrong place :upside_down_face:
Any update on this.
Any update on this
Still in investigation. we see different modes of failure between cgroupv1 and cgroupv2.
Maybe we need to setup prow user on these VMs so fcos behaves more like cos?
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1499826327681765376/
Still failing :\
Still doesn't appear to be working https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-node-kubelet-serial-crio-cgroupv2/1506314449798041600/
Still doesn't appear to be working https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-node-kubelet-serial-crio-cgroupv2/1506314449798041600/
@ameukam looking at those logs:
- Are
GCE_SSH_PUBLIC_KEY_FILE(/etc/ssh-key-secret/ssh-public) andJENKINS_GCE_SSH_PUBLIC_KEY_FILE(/workspace/.ssh/google_compute_engine.pub) the same? (Edit: looks like they're the same) - Is
/etc/ssh-key-secret/ssh-publicautomatically mounted into the GCE machine? Because we copy them in an startup systemd unit: https://github.com/kubernetes/test-infra/blob/657921483a72e4bab9b7478df4552019376750ea/jobs/e2e_node/crio/crio_serial.ign#L24
Edit: Looks like the key is not populated into the machine. The serial console states that the authorized-key.service failed:
[[0;1;31mFAILED[0m] Failed to start [0;1;39mCopy authorized keys[0m.
See 'systemctl status authorized-key.service' for details.
[ 17.765906] audit: type=1131 audit(1648020852.285:194): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sshd-keygen@ed25519 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[[0;32m OK [0m] Started [0;1;39mAuthorization Manager[0m.
[ 17.897748] audit: type=1130 audit(1648020852.293:195): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[[0;32m OK [0m] Finished [0;1;39mOpenSSH rsa Server Key Generation[0m.
[[0;32m OK [0m] Reached target [0;1;39msshd-keygen.target[0m.
Starting [0;1;39mGenerate SSH keys…nsole-login-helper-messages[0m...
Starting [0;1;39mOpenSSH server daemon[0m...
[[0;32m OK [0m] Started [0;1;39mOpenSSH server daemon[0m.
[[0;32m OK [0m] Finished [0;1;39mGenerate SSH keys…console-login-helper-messages[0m.
[[0;32m OK [0m] Started [0;1;39mrpm-ostree System Management Daemon[0m.
[[0;1;31mFAILED[0m] Failed to start [0;1;39mAfterburn (SSH Keys)[0m.
…
Ignition: user-provided config was applied
[0;33mNo SSH authorized keys provided by Ignition or Afterburn[0m
Something like this may work: https://github.com/kubernetes/kubernetes/pull/108909
@saschagrunert can confirm both are identical.
This might get closed by the k/k PR, but we need to still update the cri-o jobs to set the env var.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen - Mark this issue or PR as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen- Mark this issue or PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.