node-problem-detector
node-problem-detector copied to clipboard
pull-npd-e2e-test failing ssh handshake
https://testgrid.k8s.io/presubmits-node-problem-detector#pull-npd-e2e-test starts to fail recently.
[1] NPD should export Prometheus metrics. When OOM kills and docker hung happen
[1] NPD should update problem_counter and problem_gauge
[1] /home/prow/go/src/k8s.io/node-problem-detector/test/e2e/metriconly/metrics_test.go:158
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:54804->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:52980->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:53002->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44696->35.184.209.153:22: read: connection reset by peer', retrying
[2] Error storing debugging data to test artifacts: [Error running command: {prow 35.184.209.153 curl http://localhost:20257/metrics 0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:52990->35.184.209.153:22: read: connection reset by peer'}
[2] Error running command: {prow 35.184.209.153 sudo journalctl -u node-problem-detector.service 0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44688->35.184.209.153:22: read: connection reset by peer'}
[2] Error running command: {prow 35.184.209.153 sudo journalctl -k 0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44708->35.184.209.153:22: read: connection reset by peer'}
[2] ]
This is affecting several different PRs: https://github.com/kubernetes/node-problem-detector/pull/955, https://github.com/kubernetes/node-problem-detector/pull/961, https://github.com/kubernetes/node-problem-detector/pull/969.
This looks like an infra issue. @BenTheElder Do you know who should we talk to?
CC @hakman
It's a problem with the jobs. SIG K8S infra does not create your test VMs. The test is attempting to SSH to a disposable test VM created by your job.
seems like the VM is not serving SSH or something similar
CC @DigitalVeer
If these are like node e2e tests, folks in SIG node might be familiar
SIG Testing strongly discourages ssh usage in cluster e2e tests, relying instead on hostexec pods when necessary, but for some node style testing that's not sufficient, and mostly folks in SIG Node work with this.
It's possible there is with an issue with the GCP projects rented by this test. It's unclear to me why the SSH connection is not working but I'll try to debug with @hakman.
This is an issue with cos-stable-117. SSH works pretty well in all other tests (which are similar).
I tried to reproduce what happens with the ext4 test and found out that the command used in the test is:
echo "fake filesystem error from problem-maker" > /sys/fs/ext4/sda1/trigger_fs_error
Once this runs, the filesystem is mounted as read-only and SSH stops working with Connection reset by peer:
[ 169.101160] EXT4-fs error (device sda1): trigger_test_error:127: comm bash: fake filesystem error from problem-maker
[ 169.108852] Aborting journal on device sda1-8.
[ 169.115130] EXT4-fs (sda1): Remounting filesystem read-only
There may be some recent changes that affect the behaviour of trigger_fs_error.
https://lore.kernel.org/all/[email protected]/t/#u
New updates:
Talked to COS team and found the root cause: https://www.spinics.net/lists/linux-ext4/msg90066.html
The kernel commit changes EXT4_MF_FS_ABORTED to EXT4_FLAGS_SHUTDOWN when fs error happens so though the fs is remounted as read-only, files can't be read by anyone and SSH connections will fail.
This is an intentional change from upstream kernel so on COS side they won't change it. The path forward would be updating the NPD test case for newer kernel versions (>=6.5.0-rc3).
The kernel commit changes EXT4_MF_FS_ABORTED to EXT4_FLAGS_SHUTDOWN when fs error happens so though the fs is remounted as read-only, files can't be read by anyone and SSH connections will fail.
@wangzhen127 I don't think SSH failing after this is an intended behaviour.
Yeah, this is from COS team's perspective, because the change in upstream. So there is not much they can do. So they recommend us to update tests. Sorry for the confusion.
No worries, I just meant that maybe they can configure the SSH server to not fail completely. I agree that the FS should become read-only, but not accepting SSH connections is quite unexpected.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten /lifecycle frozen
I'll take this up. Instead of relying on ProblemMaker for this test, is creating a temporary read-only filesystem and remounting a viable alternative here? If the node won't accept SSH connections after the FS becomes read-only, I'm not quite sure how to proceed with the test assertions.
Ubuntu 24.04 has the same issue. /reopen
@hakman: Reopened this issue.
In response to this:
Ubuntu 24.04 has the same issue. /reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.