e2e-framework fix: properly handle skip on assessments

What type of PR is this?

/kind bug

What this PR does / why we need it:

t.Skip with --fail-fast was resulting in the parent test being failed, this patch fixes that, however I'm not entirely sure about the sematic of --fail-fast, so I'd appreciate if someone could clarify that.

For sure, independently of --fail-fast a call to t.Skip* in an Assessment should only skip the current Assessment. But in case of calls to t.Fail or t.FailNow, what should be the behaviour in the two cases?

Which issue(s) this PR fixes:

Fixes https://github.com/kubernetes-sigs/e2e-framework/issues/472

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix bug failing tests being skipped when `--fail-fast` is set.

Additional documentation e.g., Usage docs, etc.:

Jan 03 '25 14:01 phisco

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: phisco Once this PR has been reviewed and has the lgtm label, please assign cpanato for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jan 03 '25 14:01 k8s-ci-robot

Hi @phisco. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jan 03 '25 14:01 k8s-ci-robot

a call to Fail should not attempt to skip any remaining assessment or feature

a call to FailNow should skip all the remaining assessment but not skip the next features

Makes sense, @Fricounet, but what about --fail-fast? Should we skip the remaining assessments independently of which one was called, Fail or FailNow when --fail-fast is set? I'd say so from: https://github.com/kubernetes-sigs/e2e-framework/blob/b69e15831920a55e0c58ece5bf765594be7ecfdc/examples/fail_fast/README.md#L3-L11

Jan 08 '25 14:01 phisco

Makes sense, @Fricounet, but what about --fail-fast? Should we skip the remaining assessments independently of which one was called, Fail or FailNow when --fail-fast is set? I'd say so from:

I'd say so too @phisco. Though, I don't agree with everything written in the README. In particular:

This can aid in getting a better turn around time especially in cases where your assessments are inter-related and a failed assessment in step 1 can mean that running the rest of the assessment is guaranteed to fail. This can be done with the help of a --fail-fast flag provided at the framework level.

To me, this is the job of t.FailNow()

Basically:

as the test writer, I know when it makes sense for my test to keep going (use t.Fail()) or stop early because there is no point going further (use t.FailNow())
as the test executor, I know if I want my whole test suite to exit early at the first sign of failure (use --fail-fast). Hence --fail-fast should skip the remaining assessments regardless of Fail/FailNow

Hope that makes sense :smile:

Jan 08 '25 15:01 Fricounet

@phisco @Fricounet Thanks for this issue. The intent of --fail-fast is properly characterized here. It was a way to abruptly stop all assessments running. This was an early feature of the framework and if you guys think it could work better, let's keep this discussion going.

cc @harshanarayana

Jan 15 '25 14:01 vladimirvivien

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 15 '25 14:04 k8s-triage-robot

/remove-lifecycle stale /ok-to-test

May 06 '25 13:05 vladimirvivien

/ok-to-test

May 06 '25 13:05 vladimirvivien

@phisco @Fricounet @harshanarayana

This PR slipped, but I think it's an important one to get clarity on and merge eventually.

Features such as --fail-fast as a way to unblock the framework's users, not knowing the full implications. Now that the framework as been enjoying wider adoption, I think we can revisit that feature to ensure it's working the best way possible.

So, here is what i am proposing (which aligns with what @Fricounet mentioned):

A test writer should use exit methods (t.Fail/t.FailNow/t.Fatal) for granular control of test steps
Call to exit methods should be applied to associated assessment/feature only, allowing remaining tests to continue
--fail-fast - short-circuits current and all remaining tests (features, and assessments) if an exit method is called
--fail-fast also short-circuits the lifecycle of the entire framework runtime if a structural error occurs (i.e. during initialization, setup step)

Hope this clears thing up. If not, let me know what I am missing or what you want to add.

May 06 '25 14:05 vladimirvivien

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 04 '25 15:08 k8s-triage-robot

@phisco are you still interested in continuing with this change?

Aug 09 '25 16:08 vladimirvivien

/remove-lifecycle stale

Aug 09 '25 16:08 vladimirvivien

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 07 '25 17:11 k8s-triage-robot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Nov 07 '25 17:11 k8s-ci-robot