test-infra Issue with SSH key on k8s-infra-prow-build cluster for CRI-O nodes

What happened:

Every time we try to run CRI-O tests on the k8s-infra-prow-build cluster, SSH fails. Most recently:

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/107361/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1479158719756374016/#1:build-log.txt%3A327

W0106 18:45:27.531] I0106 18:45:27.530809    6869 ssh.go:120] Running the command ssh, with args: [-o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR -i /workspace/.ssh/google_compute_engine [email protected] -- sudo sh -c 'systemctl list-units  --type=service  --state=running | grep -e docker -e containerd -e crio']
W0106 18:45:27.942] E0106 18:45:27.942276    6869 ssh.go:123] failed to run SSH command: out: [email protected]: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
W0106 18:45:27.942] , err: exit status 255

What you expected to happen:

Tests should successfully be able to ssh.

How to reproduce it (as minimally and precisely as possible):

We've seen this every time we've tried to migrate CRI-O tests to this cluster, e.g. https://github.com/kubernetes/kubernetes/issues/102624 https://github.com/kubernetes/test-infra/pull/24591

I think there is an issue with the SSH key. We keep working around this by avoiding running CRI-O tests on this cluster. Can someone from SIG Infra help out?

Please provide links to example occurrences, if any:

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/107361/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1479158719756374016/#1:build-log.txt%3A327

Anything else we need to know?:

/sig node k8s-infra

Jan 06 '22 18:01 ehashman

SIG k8s-infra are most active in https://github.com/kubernetes/k8s.io where most of the infra is managed (including k8s-infra-prow-builds cluster).

The ssh key itself shouldn't matter much, with correct setup the key should be added as trusted to the nodes when doing gcloud compute ssh as long as the GCP user has permission.

The same key should be used for other GCP ssh jobs, so I would imagine more likely this is an issue with the cri-o VM image not supporting the key agent or something.

Jan 07 '22 18:01 BenTheElder

/assign @haircommander

Jan 12 '22 18:01 SergeyKanzhelev

cc: @ameukam

Jan 19 '22 18:01 SergeyKanzhelev

/kind failing-test /remove-kind bug

Jan 19 '22 18:01 SergeyKanzhelev

from what I can tell this was fixed by https://github.com/kubernetes/test-infra/pull/25080. Can this be closed @ehashman

Feb 14 '22 21:02 haircommander

are we sure? the latest run failed: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1491479336685932544/

 I0209 18:43:12.996] unable to create gce instance with running docker daemon for image fedora-coreos-35-20220116-3-0-gcp-x86-64.  instance n1-standard-2-fedora-coreos-35-20220116-3-0-gcp-x86-64-2c22c0ee not running docker/containerd/crio daemon - Command failed: [email protected]: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

Feb 14 '22 23:02 ehashman

oops, I was looking in the wrong place :upside_down_face:

Feb 15 '22 14:02 haircommander

Any update on this.

Feb 24 '22 06:02 pacoxu

Any update on this

Still in investigation. we see different modes of failure between cgroupv1 and cgroupv2.

Feb 24 '22 07:02 ameukam

Maybe we need to setup prow user on these VMs so fcos behaves more like cos?

Mar 03 '22 16:03 haircommander

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1499826327681765376/

Still failing :\

Mar 05 '22 01:03 ehashman

Still doesn't appear to be working https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-node-kubelet-serial-crio-cgroupv2/1506314449798041600/

Mar 22 '22 22:03 ehashman

Still doesn't appear to be working https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/92316/pull-kubernetes-node-kubelet-serial-crio-cgroupv2/1506314449798041600/

@ameukam looking at those logs:

Are GCE_SSH_PUBLIC_KEY_FILE (/etc/ssh-key-secret/ssh-public) and JENKINS_GCE_SSH_PUBLIC_KEY_FILE (/workspace/.ssh/google_compute_engine.pub) the same? (Edit: looks like they're the same)
Is /etc/ssh-key-secret/ssh-public automatically mounted into the GCE machine? Because we copy them in an startup systemd unit: https://github.com/kubernetes/test-infra/blob/657921483a72e4bab9b7478df4552019376750ea/jobs/e2e_node/crio/crio_serial.ign#L24

Edit: Looks like the key is not populated into the machine. The serial console states that the authorized-key.service failed:

[[0;1;31mFAILED[0m] Failed to start [0;1;39mCopy authorized keys[0m.
See 'systemctl status authorized-key.service' for details.
[   17.765906] audit: type=1131 audit(1648020852.285:194): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sshd-keygen@ed25519 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[[0;32m  OK  [0m] Started [0;1;39mAuthorization Manager[0m.
[   17.897748] audit: type=1130 audit(1648020852.293:195): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=chronyd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[[0;32m  OK  [0m] Finished [0;1;39mOpenSSH rsa Server Key Generation[0m.
[[0;32m  OK  [0m] Reached target [0;1;39msshd-keygen.target[0m.
         Starting [0;1;39mGenerate SSH keys…nsole-login-helper-messages[0m...
         Starting [0;1;39mOpenSSH server daemon[0m...
[[0;32m  OK  [0m] Started [0;1;39mOpenSSH server daemon[0m.
[[0;32m  OK  [0m] Finished [0;1;39mGenerate SSH keys…console-login-helper-messages[0m.
[[0;32m  OK  [0m] Started [0;1;39mrpm-ostree System Management Daemon[0m.
[[0;1;31mFAILED[0m] Failed to start [0;1;39mAfterburn (SSH Keys)[0m.
…
Ignition: user-provided config was applied
[0;33mNo SSH authorized keys provided by Ignition or Afterburn[0m

Mar 23 '22 07:03 saschagrunert

Something like this may work: https://github.com/kubernetes/kubernetes/pull/108909

Mar 23 '22 08:03 saschagrunert

@saschagrunert can confirm both are identical.

Mar 23 '22 09:03 ameukam

This might get closed by the k/k PR, but we need to still update the cri-o jobs to set the env var.

Mar 29 '22 22:03 ehashman

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 28 '22 01:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jul 28 '22 02:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Aug 27 '22 02:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 27 '22 02:08 k8s-ci-robot