test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Migrate CRI-O jobs away from `kubernetes_e2e.py`

Open saschagrunert opened this issue 1 year ago • 52 comments

The kubernetes_e2e.py script is deprecated and we should use kubetest2 instead.

All affected tests are listed in https://testgrid.k8s.io/sig-node-cri-o

cc @kubernetes/sig-node-cri-o-test-maintainers

Ref: https://github.com/kubernetes/test-infra/tree/master/scenarios, https://github.com/kubernetes/test-infra/issues/20760

saschagrunert avatar May 06 '24 09:05 saschagrunert

/sig node

haircommander avatar May 06 '24 13:05 haircommander

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 04 '24 14:08 k8s-triage-robot

/remove-lifecycle stale

saschagrunert avatar Aug 05 '24 07:08 saschagrunert

/triage accepted /priority important-longterm

kannon92 avatar Aug 21 '24 17:08 kannon92

Does this still need help? can i start looking at it?

elieserr avatar Sep 05 '24 12:09 elieserr

@elieser1101 I'd appreciate your eyes on that. :pray:

saschagrunert avatar Sep 05 '24 12:09 saschagrunert

/assign

elieserr avatar Sep 05 '24 12:09 elieserr

I opened many PRs to replicate the presubmit ones. After merging I would like to create a noop PR to test all the changes I made and fix anything broken.

After that I can start working on the periodics. Reviews needed

elieserr avatar Oct 01 '24 21:10 elieserr

Any feedback or suggestions would be appreciated.

/cc saschagrunert kannon92 krzyzacy

elieserr avatar Oct 01 '24 21:10 elieserr

The kubetest2 dra jobs seems to have a syntax error:

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/127985/pull-kubernetes-node-e2e-crio-cgrpv1-dra-kubetest2/1844649030915723264 https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/127985/pull-kubernetes-node-e2e-crio-cgrpv2-dra-kubetest2/1844649032593444864

Error: unknown flag: --label-filter

Should we fix that up here or is it another issue?

saschagrunert avatar Oct 11 '24 08:10 saschagrunert

those 2 are part of the batch I migrated to kubetest2, I can look at it

elieserr avatar Oct 11 '24 12:10 elieserr

ah sorry, I missed this. https://github.com/kubernetes/test-infra/pull/33647

@elieser1101 There are quite a few ones failing.

kannon92 avatar Oct 14 '24 14:10 kannon92

With https://github.com/kubernetes/test-infra/pull/33658, pull-kubernetes-node-e2e-crio-cgrpv1-dra-kubetest2 is now passed. pull-kubernetes-node-e2e-crio-cgrpv2-dra-kubetest2 is similar and should be fixed as well.

  • example: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/127511/pull-kubernetes-node-e2e-crio-cgrpv1-dra-kubetest2/1846087616173182976

pacoxu avatar Oct 15 '24 07:10 pacoxu

For the test that don't pass I can se the following (on https://github.com/kubernetes/kubernetes/pull/128092) pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2 fails and the non kubetest2 have been failing for some time now https://testgrid.k8s.io/sig-node-presubmits#pr-crio-cgrpv1-evented-pleg-gce-e2e-kubetest2 https://testgrid.k8s.io/sig-node-presubmits#pr-crio-cgrpv1-evented-pleg-gce-e2e

And for the pull-kubernetes-node-crio-cgrpv2-imagefs-e2e-kubetest2 https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/pr-logs/directory/pull-kubernetes-node-crio-cgrpv2-imagefs-e2e-kubetest2 pull-kubernetes-node-crio-cgrpv2-splitfs-e2e-kubetest2 https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/pr-logs/directory/pull-kubernetes-node-crio-cgrpv2-splitfs-e2e-kubetest2

both if i review the job history worked at some point but is not consistent so i'm not sure if there is something on my side to complete for this, any pointers on how to proceed with these jobs will be helpful @kannon92

elieserr avatar Nov 08 '24 10:11 elieserr

The non kubetest jobs for imagefs seem pretty green. It sounds like there is a kubetest migration issue.

kannon92 avatar Nov 08 '24 14:11 kannon92

@elieser1101

Where are we with this for presubmits?

kannon92 avatar Nov 21 '24 15:11 kannon92

Still got no luck with the jobs mentioned here

I think could be related whit a kubetest2 issue to which I opened a PR, at the moment is no possible to set container-runtime-endpoint which is always defaulting to containerd.

we can see the command the jobs are running includes the flag duplicated.

--container-runtime-endpoint=unix:///run/containerd/containerd.sock --container-runtime-endpoint=unix:///var/run/crio/crio.sock

elieserr avatar Nov 25 '24 22:11 elieserr

I see an issue with the DRA tests:

  • https://testgrid.k8s.io/sig-node-cri-o#pr-node-kubelet-crio-cgrpv1-dra-kubetest2
  • https://testgrid.k8s.io/sig-node-cri-o#pr-node-kubelet-crio-cgrpv2-dra-kubetest2

I think there is an issue with the label-filter and it is not finding the right jobs for DRA.

kannon92 avatar Dec 02 '24 16:12 kannon92

@kannon92 @elieser1101 @haircommander Looking at failing splitfs and imagefs pr jobs, I noticed random test failures inside a container with this error message:

sh: error while loading shared libraries: /lib/libc.so.6: cannot apply additional memory protection after relocation: Permission denied"

This seems to be a core issue causing jobs to fail.

Unfortunately I can't reproduce it in my environment. Here is how I run kubetest2 for splitfs tests:

$ GCE_SSH_PUBLIC_KEY_FILE=/home/ed/.ssh/google_compute_engine.pub KUBE_SSH_USER=core IGNITION_INJECT_GCE_SSH_PUBLIC_KEY_FILE=1 JENKINS_GCE_SSH_PRIVATE_KEY_FILE=/home/ed/.ssh/google_compute_engine kubetest2-gce --test=node --down=false -- --parallelism=8 --gcp-zone=us-west1-b --gcp-project=service-mesh-296815 --repo-root=. --image-config-file=/home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/crio/latest/image-config-cgroupv2-splitfs.yaml --delete-instances=false --test-args='--container-runtime-endpoint=unix:///var/run/crio/crio.sock --container-runtime-process-name=/usr/local/bin/crio --container-runtime-pid-file= --kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"crio.log\", \"journalctl\": [\"-u\", \"crio\"]}"' --skip-regex='\[Flaky\]|\[Slow\]|\[Serial\]' --focus-regex='\[NodeConformance\]|\[NodeFeature:.+\]|\[NodeFeature\]' 2>&1 | tee /tmp/log

I suspect that this could be caused by the host/vm kernel and container resource restrictions, but I don't know how to specify upper level instance image (gcr.io/k8s-staging-test-infra/kubekins-e2e:v20241128-8df65c072f-master) and container resources (cpu 4 and memory 6Gi) when running kubetest2 locally.

Any ideas how to proceed further?

bart0sh avatar Dec 11 '24 09:12 bart0sh

do you have access to the nodes you've provisioned @bart0sh ? can I poke around? basically, we want to be able to run ausearch -m AVC -ts recent after the failure to see what was being blocked, then we can update the selinux policy we create in ignition to include the new option

haircommander avatar Dec 11 '24 20:12 haircommander

@haircommander yes, I have access to the nodes, but I can't reproduce the error there :(

bart0sh avatar Dec 11 '24 20:12 bart0sh

btw, decreasing parallelism seem to help a bit: https://testgrid.k8s.io/sig-node-presubmits#pr-crio-cgrpv2-splitfs-e2e-kubetest2&width=90

bart0sh avatar Dec 11 '24 21:12 bart0sh

Trigered the job couple of times and seem to be improved(still failed at some point), also takes longer to complete. We could test the imagefs one with the same approach @bart0sh but i guess the right fix includes the selinux cahnge?

elieserr avatar Dec 11 '24 21:12 elieserr

I'm not sure about it. selinux configuration is the same for kubetest2 and old jobs, but only kubetest2 jobs fail.

bart0sh avatar Dec 11 '24 21:12 bart0sh

@elieser1101

We could test the imagefs one with the same approach

decreasing parallelism improved imagefs test. Previously I didn't see any successful job runs. With the change I can see at least one so far.

bart0sh avatar Dec 12 '24 14:12 bart0sh

I'm still wondering why I can't repro error while loading shared libraries: /lib/libc.so.6: cannot apply additional memory protection after relocation: Permission denied in my environment. I'm using the same kubetest2 command line parameters and the same image configs, .ign file, instance type, gcp zone etc. Even using --processes=100 command line option doesn't help to trigger the error.

bart0sh avatar Dec 13 '24 10:12 bart0sh

Unfortunately using more powerful instance didn't change much for imagefs job. I can still see the same error in the logs.

bart0sh avatar Dec 13 '24 23:12 bart0sh

@elieser1101 I can see a lot of green kubetest2 jobs in the test grid. Is there anything that prevents replacing kubernetes_e2e.py jobs with them? I did it for splitfs and imagefs jobs as I was involved in fixing them. I can do it for the rest of jobs if needed.

bart0sh avatar Dec 18 '24 01:12 bart0sh

@bart0sh thank you very much for the splitfs/imagefs that was a great finding


What would come next is to validate that the kubetest2 are actually working. Meaning, I noticed that some of the jobs are completing but are skipping all the specs. We would like to ensure we are running the jobs properly before replacing the kubernetes_e2e.py jobs.

At the moment im loking at the DRA ones wich were missing some kubetest2 features and this

elieserr avatar Dec 18 '24 12:12 elieserr

@elieser1101 pull-crio-cgroupv2-node-e2e-eviction-kubetest2 fails with Context was cancelled (cause: suite timeout occurred) after 235.856s., which is quite strange as I don't see this timeout specified anywhere. correspondent non-kubetest2 test case has longer timeout and passes. So, this seems to be caused by kubetest2. Do you happen to know the reason? Did you see this error in other job logs?

bart0sh avatar Jan 02 '25 17:01 bart0sh