cylc-flow Log job failure even when it does not cause a change in task state.

Closes #6151

Check List

[x] I have read CONTRIBUTING.md and added my name as a Code Contributor.
[x] Contains logically grouped changes (else tidy your branch by rebase).
[x] Does not contain off-topic changes (use other PRs for other changes).
[x] Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
[x] Tests are included (or explain why tests are not needed).
[x] CHANGES.md entry included if this is a change that can affect users
[x] Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
[x] If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

Jun 25 '24 13:06 wxtim

I think this diff means that polled log messages for task failure will go back to being duplicated.

I think that you have a particular closed issue in mind, but I can't find it... Can you point it out to me?

This only covers failure, but submission failure also has retries so may require similar treatment.

I think that submission failure is already handled correctly - it certainly is in the simplistic case where you feed it platform=garbage you get a submission failed message at level critical:

CRITICAL - [1/bar/01:preparing] submission failed
CRITICAL - [1/bar/02:preparing] submission failed
CRITICAL - [1/bar/03:preparing] submission failed

These are logged at critical - and I think they should be?

I think the failures before retries are exhausted will now get logged at CRITICAL level rather than INFO.

This would be consistent with submit failure...

Jul 15 '24 09:07 wxtim

I think this diff means that polled log messages for task failure will go back to being duplicated.

I think that you have a particular closed issue in mind, but I can't find it... Can you point it out to me?

No, I'm not thinking of the other log message duplication issue.

The change made here bypassed logic that was used for suppressing duplicate log messages (the f'{FAIL_MESSAGE_PREFIX}ERR' bit):

https://github.com/cylc/cylc-flow/pull/6169/commits/8f20ab088d2305670667f1beadd31de5611e5ec0#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eL930

However, in your more recent "fix" commit, you have put this back the way it was before:

https://github.com/cylc/cylc-flow/pull/6169/commits/3cedf2facefc58c61b14acc9cf44e0fee5f44710#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eR930

Jul 15 '24 12:07 oliver-sanders

This does not apply to submit failure, because submit failure will always log a critical warning through the jobs-submit command.

WARNING - platform: mulberry - Could not connect to mymachine.
    * mymachine has been added to the list of unreachable hosts
    * remote-init will retry if another host is available.
ERROR - [jobs-submit cmd] (remote init)
    [jobs-submit ret_code] 1
CRITICAL - [1/bar/02:preparing] submission failed
INFO - [1/bar/02:preparing] => waiting
WARNING - [1/bar:waiting] retrying in PT1S (after 2024-07-22T14:51:37+01:00)

Jul 22 '24 13:07 wxtim

Test failures

Oct 01 '24 17:10 MetRonnie

@oliver-sanders & @MetRonnie : Sorry, but I made a total hash of rebasing this and may have introduced some regressions. :(

I think that I've been through all your comments, but please double check.

Oct 03 '24 10:10 wxtim

The first commit doesn't appear to be associated with your GH account?

Oct 04 '24 11:10 oliver-sanders

The first commit doesn't appear to be associated with your GH account?

I think that I've messed up the setup of NewSpiceVDI.

Oct 04 '24 13:10 wxtim

You'll either need to add a mailmap entry for the new setup, or use git rebase to amend the commit to change the author, otherwise the check-contributors action will fail when next run.

Oct 07 '24 08:10 oliver-sanders

You'll either need to add a mailmap entry for the new setup, or use git rebase to amend the commit to change the author, otherwise the check-contributors action will fail when next run.

Fixed - I'm not adding a metoffic.gov.uk email address to my history.

Oct 09 '24 08:10 wxtim

@hjoliver, swapping in you as a reviewer as you have also attempted this in #6401 and might want to compare approaches.

Oct 14 '24 09:10 oliver-sanders

@wxtim - I made a duplicate attempt at this without realizing you'd already worked on it, sorry. I thought I'd found a small bug with no associated issue. #6401

My bad, but having done it we might as well compare approaches. Both work, but mine is simpler (a one-liner) and I think the result is more consistent between submission and execution failure - see example below.

So my feeling is, we should use my branch, but cherry-pick your integration test to it. Would you agree?

[scheduling]
    [[graph]]
        R1 = """
           a & b
        """
[runtime]
    [[a]]
        script = """
            cylc broadcast -n a -s "script = true" $CYLC_WORKFLOW_ID
            cylc broadcast -n b -s "platform = " $CYLC_WORKFLOW_ID
            false
        """
        execution retry delays = PT5S
    [[b]]
        platform = fake
        submission retry delays = PT5S

Log comparison (left me, right you):

shot

Oct 22 '24 03:10 hjoliver

@hjoliver The CRITICAL level is probably too much though? Surely WARNING is the right level?

Oct 22 '24 09:10 MetRonnie

@hjoliver The CRITICAL level is probably too much though? Surely WARNING is the right level?

Perhaps, but @wxtim's approach still leaves submit-fail (with a retry) at CRITICAL - hence my consistency comment above. Why treat the two differently?

The level is arguable. I think it's OK to log the actual job or job submission failure as critical, but have the workflow then handle it automatically.

Oct 22 '24 21:10 hjoliver

I think that the correct level is error.

@hjoliver - does your PR fall victim to any of Oliver's comments from https://github.com/cylc/cylc-flow/pull/6169#pullrequestreview-2141436538?

Oct 23 '24 11:10 wxtim

If execution/submission retry delays are configured, then execution/submission failures (respectively) are expected to occur. Therefore it is not a CRITICAL message to log. Only if the retries are exhausted should it be a CRITICAL level message?

Oct 23 '24 11:10 MetRonnie

If execution/submission retry delays are configured, then execution/submission failures (respectively) are expected to occur. Therefore it is not a CRITICAL message to log. Only if the retries are exhausted should it be a CRITICAL level message?

I don't disagree with that, but it was kinda off-topic for this Issue - which is about not hiding the job failure from the log - unless we introduce a jarring inconsistency between the way submission and execution failures are logged.

But OK, if we want to two kill two birds with one stone, let's look at unhiding the job failure AND changing the level of both kinds of failure at once, to maintain consistency...

I agree with @wxtim 's assertion that the correct level (for both) is ERROR.

Oct 24 '24 01:10 hjoliver

let's look at unhiding the job failure AND changing the level of both kinds of failure at once, to maintain consistency...

In that case, I would go with your approach @wxtim - but with some tweaks:

don't log failure messages at all in the early _process_message_check method (where the demotion to DEBUG is currently done)
instead log for both retries and not-retries in _process_message_failed where we already have the right code blocks:

if retries:
   LOG.error(message)
   ...
else:
   LOG.critical(message)
   ...

and do exactly the same for job submit failure, in process_message_submit_failed

[UPDATE] Actually, it should be ERROR both ways:

LOG.error(message)
if retries:
   ....
else:
   ...

(a job error is always an error, but whether it's critical or not depends on the graph... if it is, it will cause a stall that does get logged as CRITICAL)

Oct 24 '24 02:10 hjoliver

@wxtim - is this ready to go again, in light of the above discussion?

Nov 05 '24 20:11 hjoliver

@wxtim - is this ready to go again, in light of the above discussion?

In my head, yes, but I see that there a load of test failures. Will draft now and undraft when ready.

Nov 06 '24 09:11 wxtim

These test failures were caused by a slight change in the nature of the message caused by moving it: By the time the code reaches _process_message_failed some of the info required for the original error messages (the state) is not available, or has changed.

Nov 06 '24 14:11 wxtim

@wxtim this seems to have diverged a bit from what I thought was agreed above, which was:

log the job (or submit) failure as ERROR whether there are retries or not
then log the retry message at WARNING level, if there are retries

Now, if there are retries we only get the retry warning. (Which I think is back to the problem we were trying to fix here, although the logging location has changed to the methods that would support the fix).

Dec 05 '24 19:12 hjoliver

Here's a screenshot comparing two execution failures (one retry) with two submission failures (one retry):

frap-2

Contrary to @MetRonnie 's suggestion (I think) you just need to get rid of the failed/ERR bit in the execution retry message. That information is already logged as the error. The warning is just about the resulting retry.

Jan 08 '25 02:01 hjoliver

Or have "execution retrying in..." and "submission retrying in..."?

Jan 08 '25 09:01 MetRonnie

Or have "execution retrying in..." and "submission retrying in..."?

Not really necessary because both entail a full "retry" (i.e., submit the job, then execute it).

Jan 08 '25 21:01 hjoliver

Still waiting on my suggestion above, I think. In case it got a bit lost in the log lines:

# submission retry (good)
ERROR - [1/foo/01:preparing] submission failed
INFO - [1/foo/01:preparing] => waiting
WARNING - [1/foo:waiting] retrying in PT5S (after 2025-01-09T11:08:20+13:00)

# execution retry (needs tweak)
ERROR - [1/foo/01:running] failed/ERR
INFO - [1/foo/01:running] => waiting
WARNING - [1/foo:waiting] failed/ERR - retrying in PT5S (after 2025-01-09T11:05:49+13:00)

(Pretty minor, but there's no need to double-log the failed/ERR bit).

Jan 08 '25 22:01 hjoliver

Still waiting on my suggestion https://github.com/cylc/cylc-flow/pull/6169#issuecomment-2576592384, I think. In case it got a bit lost in the log lines:

Done, but not pushed, because I was was also looking at Ronnie's Comments about the tests.

Jan 09 '25 09:01 wxtim

This PR has stalled, what needs to be done?

Feb 12 '25 16:02 oliver-sanders

This PR has stalled, what needs to be done?

I thought I'd responded to Hilary's comments. Will double check that I have and poke him. Also rebase.

Feb 13 '25 10:02 wxtim

@wxtim - note I pushed a v small change to your branch, to enable a final review (all the tests were failing due to an extra unused method arg) - https://github.com/cylc/cylc-flow/pull/6169/commits/4cb02e43140b1b7969fb4a3e504ef3d1ca151926

Feb 16 '25 23:02 hjoliver

This can go to 8.4.1 IMO - a quick review at this point, in spite of the rather long history!

Feb 17 '25 04:02 hjoliver

cylc-flow cylc-flow copied to clipboard

Log job failure even when it does not cause a change in task state.

cylc-flow
cylc-flow copied to clipboard