k8s.io icon indicating copy to clipboard operation
k8s.io copied to clipboard

Lingering 3rd party project dependency on prow.k8s.io

Open BenTheElder opened this issue 1 year ago • 20 comments

See also the umbrella issue at https://github.com/kubernetes/k8s.io/issues/7708 and previous discussion at https://github.com/kubernetes/test-infra/issues/12863

Right now this represents an ambiguous policy gap where we only provide CI for Kubernetes subprojects ... except cadvisor and containerd. AFAICT all others are on their own non-kubernetes-provided CI now.

While the cost is likely not high, it is difficult to reason about a consistent policy currently, with these two small exceptions resolved we could pretty reasonably state that we provide CI for the Kubernetes project and its official subprojects, within reason (with some right reserved to deal with budget-busting usage and/or abuse e.g. "crypto mining").

Otherwise we need to explain coherent reasoning that does not open us up to hosting the entire landscape's CI. We already have a huge problem on our hands handling the 100s of Kubernetes subprojects, scale testing, content distribution, etc.

/sig k8s-infra /sig testing cc @kubernetes/sig-k8s-infra-leads @kubernetes/sig-testing-leads

BenTheElder avatar Jan 22 '25 17:01 BenTheElder

@dims confirmed that cadvisor's prow CI is actually still not functional since the prow.k8s.io control plane migration from google.com to kubernetes.io (community) GCP project, so that leaves only github.comcontainerd/containerd and Kubernetes' own projects/subprojects.

BenTheElder avatar Jan 22 '25 17:01 BenTheElder

Some discussion in https://kubernetes.slack.com/archives/CCK68P2Q2/p1737568484394929?thread_ts=1737566661.037439&cid=CCK68P2Q2

BenTheElder avatar Jan 22 '25 18:01 BenTheElder

Do we want to include CRI-O given what some tests using the K8S Infrastructure ? https://testgrid.k8s.io/sig-node-cri-o

ameukam avatar Jan 22 '25 18:01 ameukam

Do we want to include CRI-O given what some tests using the K8S Infrastructure ? https://testgrid.k8s.io/sig-node-cri-o

My understanding is that these are jobs that are not directly connected to the repo and have kubelet using a stable release so as to not run kubelet tests with a single CRI implementation (other jobs mostly use stable containerd)

That's different from jobs aimed at testing the development of a third party project (or worse requiring additional over-permissioning of our CI accounts etc for presubmit, or webhooks or), unless I've missed something.

If there are jobs testing cri-o development, I think that would need to be discussed here.

Similarly to create a cluster to test kubernetes we ultimately use other projects, but we are not operating CI for this projects.

(So also we would not remove any of those, but we would say, a job testing cilium @ HEAD, the difference is in which repo's changes are under test)

BenTheElder avatar Jan 22 '25 18:01 BenTheElder

Right now this represents an ambiguous policy gap where we only provide CI for Kubernetes subprojects ... except cadvisor and containerd

The containerd prow jobs are specifically testing compatibility with Kubernetes using node e2e tests. We have separate CI for core containerd and for CRI using critest that runs in GitHub Actions via the containerd org (and funded by CNCF). I think if we want to migrate containerd completely off prow jobs, we'd need guidance on how to best run the node e2e tests elsewhere.

samuelkarp avatar Jan 22 '25 18:01 samuelkarp

The containerd prow jobs are specifically testing compatibility with Kubernetes using node e2e tests.

Sure, but every other landscape project especially CNI CSI implementations would argue the same and yet we're not hosting CI for those repos (unless they are a subproject).

We have separate CI for core containerd and for CRI using critest that runs in GitHub Actions via the containerd org (and funded by CNCF). I think if we want to migrate containerd completely off prow jobs, we'd need guidance on how to best run the node e2e tests elsewhere.

SIG node should know best how to run node_e2e tests, but I think you could run these against a vagrant VM in actions as suggested by @upodroid in the slack thread.

BenTheElder avatar Jan 22 '25 19:01 BenTheElder

I've opened https://github.com/containerd/containerd/issues/11486 to track the work on the containerd side.

samuelkarp avatar Mar 05 '25 20:03 samuelkarp

but I think you could run these against a vagrant VM in actions as suggested by @upodroid in the slack thread.

Maybe make that limactl instead though. https://github.com/lima-vm/lima

edit: https://github.com/kubernetes-sigs/kind/blob/main/.github/workflows/vm.yaml is using this

BenTheElder avatar Mar 05 '25 20:03 BenTheElder

Lima

I also created https://github.com/lima-vm/lima-actions to simplify the setup

steps:
  - uses: actions/checkout@v4

  - uses: lima-vm/lima-actions/setup@v1
    id: lima-actions-setup

  - uses: actions/cache@v4
    with:
      path: ~/.cache/lima
      key: lima-${{ steps.lima-actions-setup.outputs.version }}

  - run: limactl start --plain --name=default --cpus=1 --memory=1 template://fedora

  - uses: lima-vm/lima-actions/ssh@v1

  - run: rsync -a -e ssh . lima-default:/tmp/repo

  - run: ssh lima-default ls -l /tmp/repo

AkihiroSuda avatar Mar 06 '25 05:03 AkihiroSuda

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 04 '25 06:06 k8s-triage-robot

/remove-lifecycle stale

xmudrii avatar Jun 04 '25 10:06 xmudrii

I think this is primarily https://github.com/containerd/containerd/issues/11486

Testgrid will be tricky later, as that's only DNS from us currently.

#7710 was resolved, so this is the main outstanding issue with conflated infra after all of the migrations out of vendor owned accounts into k8s infra / CNCF owned

BenTheElder avatar Jun 04 '25 15:06 BenTheElder

Opened https://github.com/containerd/containerd/pull/12028 to migrate the node e2e presubmit to GitHub Actions

chrishenzie avatar Jun 27 '25 05:06 chrishenzie

Is the focus of this effort to just remove the containerd presubmit jobs?

From what I can tell, the periodic and postsubmit jobs are for building cached versions of containerd so they don't require rebuild on every CI run.

chrishenzie avatar Jun 27 '25 06:06 chrishenzie

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 25 '25 06:09 k8s-triage-robot

Is the focus of this effort to just remove the containerd presubmit jobs?

IMHO the focus was to totally decouple the infra from these projects, so we no longer have exceptions to sig k8s infra == infra for the Kubernetes orgs ... to avoid awkward/difficult carve-outs.

I've also stepped down as a TL though, so this is just my take on the original context.

BenTheElder avatar Oct 17 '25 19:10 BenTheElder

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 16 '25 19:11 k8s-triage-robot

/remove-lifecycle rotten

xmudrii avatar Nov 17 '25 13:11 xmudrii

https://github.com/kubernetes/test-infra/pull/36004 landed recently to decouple cadvisor.

BenTheElder avatar Dec 08 '25 18:12 BenTheElder

I think what's remaining is containerd.

ameukam avatar Dec 08 '25 20:12 ameukam