test-infra pleg: enable tests on file changes

Fix: Enable Automatic Test Triggering for PLEG Component Currently, tests are not triggered automatically as seen in PR #126843.

Added configuration to trigger tests automatically when files in the PLEG component are modified.
Updated run_if_changed to include the PLEG path: pkg/kubelet/pleg.

Sep 12 '24 09:09 oxxenix

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: oxxenix Once this PR has been reviewed and has the lgtm label, please assign priyankasaggu11929 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

config/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Sep 12 '24 09:09 k8s-ci-robot

Welcome @oxxenix!

It looks like this is your first PR to kubernetes/test-infra 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/test-infra has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

Sep 12 '24 09:09 k8s-ci-robot

Hi @oxxenix. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sep 12 '24 09:09 k8s-ci-robot

/cc @bart0sh, @pohly

Sep 12 '24 10:09 oxxenix

Hmm. Why are you doing this?

Sep 12 '24 12:09 kannon92

You can trigger pleg jobs manually and I hope we have them running in the periodic.

Sep 12 '24 12:09 kannon92

/hold

we are not going to add automatic presubmits for alpha features, that does not scale well,

Please follow Kevin suggestions, it is ok to setup periodic jobs, but we need to keep a high bar on presubmits

Thanks

Sep 12 '24 14:09 aojea

@aojea One minor correction. We have a high bar for required presubmits. Please feel free to add presubmits for alpha features but we should not require them to run on changes.

Sep 12 '24 14:09 kannon92

@kannon92

You can trigger pleg jobs manually and I hope we have them running in the periodic.

Does this mean that we're ok not to test PLEG changes automatically because this is an Alpha feature? So, we basically allow to break CI and prefer to investigate breakages later, when periodic job breaks, correct?

Sep 12 '24 22:09 bart0sh

@aojea

we are not going to add automatic presubmits for alpha features, that does not scale well,

There are alpha features that are tested this way as far as I can see.

Sep 12 '24 22:09 bart0sh

@aojea

we are not going to add automatic presubmits for alpha features, that does not scale well,

There are alpha features that are tested this way as far as I can see.

well , that I don't agree with, specially for alphas like EventedPLEG and Pod Vertical Autoscaling that are very brittle, as a developer don't want to see my PR blocked for a job like this.

As an EventedPLEG or Pod Vertical Autoscaling I can run my presubmit manually or locally, no need to make the entire project to run the tests of my feature

Sep 13 '24 06:09 aojea

Does this mean that we're ok not to test PLEG changes automatically because this is an Alpha feature? So, we basically allow to break CI and prefer to investigate breakages later, when periodic job breaks, correct?

That statement is not fair, evented PLEG has been broken since always, and has downgraded in 1.30, there was a bug in 1.27 reporting the problem and nobody checked at it until I put a job that consistenty failing on that problem, there are periodic jobs and manual jobs and users reporting issues ..., let's don't use the automation on presubmits as excuse,

Sep 13 '24 06:09 aojea

as a developer don't want to see my PR blocked for a job like this

But it wouldn't be blocked, would it? The pre-submit jobs for alpha features have to be optional (only run in some cases) and must not be required (don't block merging when they fail).

What can be a problem is the resulting confusion when some optional job gets triggered by someone who happened to make some minor change in a file that is being watched and then the job fails. That developer then typically won't know whether it's really okay to ignore the failure.

I think whoever owns that job is on the hook for ensuring that it runs reliably, despite the alpha feature. This disqualifies pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e which has known issues.

Sep 13 '24 07:09 pohly

@aojea

As an EventedPLEG or Pod Vertical Autoscaling I can run my presubmit manually or locally, no need to make the entire project to run the tests of my feature

To be honest, I'm having a hard time to understand the general idea. So far I was under impression that CI jobs aim to guard the project from breakages by running certain checks when PR changes certain part of the codebase. Can you explain what's the idea of not running for example DRA(???) jobs on DRA code changes? To free some resources on CI infra?

Sep 13 '24 07:09 bart0sh

To be honest, I'm having a hard time to understand the general idea. So far I was under impression that CI jobs aim to guard the project from breakages by running certain checks when PR changes certain part of the codebase. Can you explain what's the idea of not running for example DRA(???) jobs on DRA code changes? To free some resources on CI infra?

Yes, sorry, I didn't explain myself correctly, my apologies.

We need to follow some strategy that is sustainable and incentivize the right behaviors.

Adding everything as presubmit has some undesired consequences:, it affects ALL developers in the project, not only the ones working in that specific feature. It also incentivize the wrong behaviors, as people may think: "as a feature owner I don't need to check my feature state because if something breaks the CI will block it". This second part is the one that we need to avoid, because feature X can break because someone breaks it or because and external dependency, and this makes a problem of one group a problem of everybody.

In this example, if I somehow edit some header in pleg and the job is red, should I merge with the job failing or should I debug an alpha feature that I don't know what is about? is warning about the job is red something will trigger some reaction? we have proof this does not work, jobs and tests red for weeks and months that end , EventedPLEG is the better example

To summarize, a strategy for testing that is sustainable

There are two type of jobs, presubmit and periodic. Presubmit can be: manually triggered, by regex or on every change. Presubmit avoid to break existing code, presubmit are expensive and impact developer experience Periodics can detect code breaks and send notifications, it just requires the feature owners to monitor and attend to the alerts, this is the part I see people prefer to put jobs as presubmit so "community" owns the problem In addition, Authors must ensure the code is tested and Reviewers and Approvers must ensure the code is tested

EDIT

to be clear, I'm scoping this to alpha features ... on enabled by default or other environmental features the topic is more complicated and I agree run_if_changed is appropiate. on same cases

Sep 13 '24 08:09 aojea

@aojea Thank you for the explanations. Much appreciated!

In this example, if I somehow edit some header in pleg and the job is red, should I merge with the job failing or should I debug an alpha feature that I don't know what is about?

It depends on if you care or not. You can either ignore it as these check is not required or let the feature developer know (create issue, ping on slack, etc) about a breakage. However, generally it's a useful info and doesn't block you, I believe.

Presubmit can be: manually triggered, by regex or on every change.

That's the main purpose of this PR as far as I understand - to trigger PLEG CI job when PLEG codebase is changed, isn't it? This is similar to other alpha jobs, e.g. DRA and InPlacePodVerticalScaling.

The alternative you're suggesting is that all people changing certain code area should know which presubmits to trigger manually if they care, correct? How they can find out the list of presubmits to trigger, especially if their changes affect a lot of areas?

presubmit are expensive

Only if they're triggered for unrelated codebase changes, I think. It's not the case for this particular PR.

Sep 13 '24 08:09 bart0sh

The alternative you're suggesting is that all people changing certain code area should know which presubmits to trigger manually if they care, correct? How they can find out the list of presubmits to trigger, especially if their changes affect a lot of areas?

I expect the reviewer to know this, and the approver to double check it, though it is not always what happens :(

presubmit are expensive

Only if they're triggered for unrelated codebase changes, I think. It's not the case for this particular PR.

I recognize I don't have a good answer Ed, if you are saying people working on this feature is going to be accountable and react faster, I really don't have much objection, per example, with DRA we all know things are addressed timely ...

my main concern is that we end again in a situation with hundreds of jobs broken and nobody looking at them, at the end is a question of community and trust

Does these jobs have an alert set to warn when they are failing?

I will remove my hold once I see this jobs alert a group of people and/or slack channel, that will give me clear signal the job will not be left orphan

Sep 13 '24 08:09 aojea

@aojea I don't have a good answer either. It depends on PLEG devs which way to choose. @harche @haircommander WDYT?

Sep 13 '24 11:09 bart0sh

/ok-to-test

Sep 13 '24 11:09 bart0sh

@aojea I don't have a good answer either. It depends on PLEG devs which way to choose. @harche @haircommander WDYT?

LGTM

Sep 13 '24 13:09 harche

@aojea

I will remove my hold once I see this jobs alert a group of people and/or slack channel

There is kubernetes-sig-node-test-failures group, with 53 subscribers. I think @harche and @haircommander are subscribed. Is that enough to remove your hold?

Sep 13 '24 14:09 bart0sh

There is kubernetes-sig-node-test-failures group, with 53 subscribers. I think @harche and @haircommander are subscribed. Is that enough to remove your hold?

/hold cancel

fair enough, thanks for the healthy discussion

Sep 13 '24 15:09 aojea

So my only issue with approving this is I don't really know what the sig-node status is on this feature.

Is anyone working on getting this out of alpha?

Sep 13 '24 17:09 kannon92

@kannon92 I can see that PLEG is proposed for consideration on the 1.32 KEP planning board.. @harche are we going to work on it in 1.32 timeframe?

Sep 13 '24 19:09 bart0sh

EventedPLEG feature is alpha and is disabled on alpha jobs because it break the clusters, that is why I was specially hesitant to add these jobs as presubmit

Sep 14 '24 21:09 aojea

@harche are we going to work on it in 1.32 timeframe?

I am running short on time due to other more critical commitments, just checking with @pacoxu, @oxxenix and @hshiina if they have some bandwidth in 1.32 for this.

Sep 16 '24 15:09 harche

I am running short on time due to other more critical commitments, just checking with @pacoxu, @oxxenix and @hshiina if they have some bandwidth in 1.32 for this.

Some Evented PLEG bugfix PRs(https://github.com/kubernetes/kubernetes/pull/122778, https://github.com/kubernetes/kubernetes/pull/124953, https://github.com/kubernetes/kubernetes/pull/126415, https://github.com/kubernetes/kubernetes/pull/127195) are pending on review or approval. We may need approvers' help in this case.

BTW, after a sidecar regression fix(https://github.com/kubernetes/kubernetes/pull/126543), there is a new issue(https://github.com/kubernetes/kubernetes/issues/127312) there and I think we can give it a try to fix it in v1.32. Again, this may need a confirmation from SIG node leads.

Sep 18 '24 07:09 pacoxu

I still think you can achieve the same with periodics, there is no need to block all the project on this alpha feature, specially this one that has 5 pending fixes and a large record of instability

Sep 18 '24 08:09 aojea

I still think you can achieve the same with periodics, there is no need to block all the project on this alpha feature, specially this one that has 5 pending fixes and a large record of instability

Agree with @aojea. And I prefer to do this after EventedPLEG is promoted to BETA or at least blockers are fixed.

Sep 18 '24 08:09 pacoxu

I agree as well, and I'm ok to close this PR, considering the above mentioned reasons.

Sep 18 '24 11:09 bart0sh