node-problem-detector icon indicating copy to clipboard operation
node-problem-detector copied to clipboard

pull-npd-e2e-test failing ssh handshake

Open wangzhen127 opened this issue 1 year ago • 14 comments

https://testgrid.k8s.io/presubmits-node-problem-detector#pull-npd-e2e-test starts to fail recently.

[1] NPD should export Prometheus metrics. When OOM kills and docker hung happen 
[1]   NPD should update problem_counter and problem_gauge
[1]   /home/prow/go/src/k8s.io/node-problem-detector/test/e2e/metriconly/metrics_test.go:158
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:54804->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:52980->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:53002->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44696->35.184.209.153:22: read: connection reset by peer', retrying
[2] Error storing debugging data to test artifacts: [Error running command: {prow 35.184.209.153 curl http://localhost:20257/metrics   0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:52990->35.184.209.153:22: read: connection reset by peer'}
[2]  Error running command: {prow 35.184.209.153 sudo journalctl -u node-problem-detector.service   0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44688->35.184.209.153:22: read: connection reset by peer'}
[2]  Error running command: {prow 35.184.209.153 sudo journalctl -k   0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44708->35.184.209.153:22: read: connection reset by peer'}
[2] ]

This is affecting several different PRs: https://github.com/kubernetes/node-problem-detector/pull/955, https://github.com/kubernetes/node-problem-detector/pull/961, https://github.com/kubernetes/node-problem-detector/pull/969.

wangzhen127 avatar Oct 09 '24 16:10 wangzhen127

This looks like an infra issue. @BenTheElder Do you know who should we talk to?

CC @hakman

wangzhen127 avatar Oct 09 '24 17:10 wangzhen127

It's a problem with the jobs. SIG K8S infra does not create your test VMs. The test is attempting to SSH to a disposable test VM created by your job.

seems like the VM is not serving SSH or something similar

BenTheElder avatar Oct 09 '24 17:10 BenTheElder

CC @DigitalVeer

wangzhen127 avatar Oct 09 '24 17:10 wangzhen127

If these are like node e2e tests, folks in SIG node might be familiar

SIG Testing strongly discourages ssh usage in cluster e2e tests, relying instead on hostexec pods when necessary, but for some node style testing that's not sufficient, and mostly folks in SIG Node work with this.

BenTheElder avatar Oct 09 '24 19:10 BenTheElder

It's possible there is with an issue with the GCP projects rented by this test. It's unclear to me why the SSH connection is not working but I'll try to debug with @hakman.

ameukam avatar Oct 10 '24 14:10 ameukam

This is an issue with cos-stable-117. SSH works pretty well in all other tests (which are similar). I tried to reproduce what happens with the ext4 test and found out that the command used in the test is:

echo "fake filesystem error from problem-maker" > /sys/fs/ext4/sda1/trigger_fs_error

Once this runs, the filesystem is mounted as read-only and SSH stops working with Connection reset by peer:

[  169.101160] EXT4-fs error (device sda1): trigger_test_error:127: comm bash: fake filesystem error from problem-maker
[  169.108852] Aborting journal on device sda1-8.
[  169.115130] EXT4-fs (sda1): Remounting filesystem read-only

There may be some recent changes that affect the behaviour of trigger_fs_error. https://lore.kernel.org/all/[email protected]/t/#u

hakman avatar Oct 13 '24 20:10 hakman

New updates:

Talked to COS team and found the root cause: https://www.spinics.net/lists/linux-ext4/msg90066.html

The kernel commit changes EXT4_MF_FS_ABORTED to EXT4_FLAGS_SHUTDOWN when fs error happens so though the fs is remounted as read-only, files can't be read by anyone and SSH connections will fail.

This is an intentional change from upstream kernel so on COS side they won't change it. The path forward would be updating the NPD test case for newer kernel versions (>=6.5.0-rc3).

wangzhen127 avatar Nov 06 '24 19:11 wangzhen127

The kernel commit changes EXT4_MF_FS_ABORTED to EXT4_FLAGS_SHUTDOWN when fs error happens so though the fs is remounted as read-only, files can't be read by anyone and SSH connections will fail.

@wangzhen127 I don't think SSH failing after this is an intended behaviour.

hakman avatar Nov 07 '24 04:11 hakman

Yeah, this is from COS team's perspective, because the change in upstream. So there is not much they can do. So they recommend us to update tests. Sorry for the confusion.

wangzhen127 avatar Nov 07 '24 19:11 wangzhen127

No worries, I just meant that maybe they can configure the SSH server to not fail completely. I agree that the FS should become read-only, but not accepting SSH connections is quite unexpected.

hakman avatar Nov 07 '24 20:11 hakman

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 05 '25 21:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Mar 07 '25 22:03 k8s-triage-robot

/remove-lifecycle rotten /lifecycle frozen

hakman avatar Mar 08 '25 02:03 hakman

I'll take this up. Instead of relying on ProblemMaker for this test, is creating a temporary read-only filesystem and remounting a viable alternative here? If the node won't accept SSH connections after the FS becomes read-only, I'm not quite sure how to proceed with the test assertions.

DigitalVeer avatar Mar 11 '25 12:03 DigitalVeer

Ubuntu 24.04 has the same issue. /reopen

hakman avatar Aug 12 '25 07:08 hakman

@hakman: Reopened this issue.

In response to this:

Ubuntu 24.04 has the same issue. /reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 12 '25 07:08 k8s-ci-robot