cylc-flow Cleaner log messages.

At INFO level, we should just log the timestamped (optionally: #3646) sequence of events in the simplest way possible, unless something goes wrong. There's too much usually-irrelevant crap in there.

E.g. (with two timestamps removed):

INFO - [zz.1] status=running: (received)succeeded for job(01)

This should be something like:

INFO - zz.1 (job 01) succeeded

No need to log current status unless it is not the expected status.

And no need to log "received" either, just modify slightly for the polled case:

INFO - zz.1 (job 01) polled succeeded

Pull requests welcome!

Jun 04 '20 00:06 hjoliver

With timestamps, these log lines are really verbose:

2020-05-29T12:24:34+12:00 INFO - [y.1] status=running: (received)succeeded at 2020-05-29T12:24:33+12:00 for job(02)

We could at least put the second timestamp ("recieved at"), which is normally the least relevant bit of info, at the end of the line:

2020-05-29T12:24:34+12:00 INFO - foo.1 (job 01) succeeded at 2020-05-29T12:24:33+12:00

And maybe even omit it if it is sufficiently similar to the log timestamp (within a few seconds, say):

2020-05-29T12:24:34+12:00 INFO - foo.1 (job 01) succeeded

Jun 04 '20 00:06 hjoliver

We should definitely clean up log messages, the only real barrier to doing so has been the functional tests which make exact comparisons to log messages and often do nasty things like awk and sed which makes tests hard to fix.

Here are my thoughts on cleaning the log files:

1) State change messages

remove time information
- It's already in the timestamp
- It's also recorded in the database
Remove pervious state
- Unless changed unexpectedly
Remove task communication method
- Unless state changed due to poll and task communication method != poll
Use same #NN job number format as we use in the GUI & TUI.

- 2020-05-29T12:24:34+12:00 INFO - [foo.1] status=running: (received)succeeded at 2020-05-29T12:24:33+12:00 for job(01)
+ 2020-05-29T12:24:34+12:00 INFO - foo.1 #01 succeeded

2) Make submitting messages comprehendible

Remove <user> from user@host
- We will no longer support this functionality and the user that Cylc reports might not be the same as the remote user anyway (due to SSH config).

- 2020-06-04T09:54:10Z INFO - [foo.1] -submit-num=01, owner@host=<user>@<host>
+ 2020-06-04T09:54:10Z INFO - foo.1 #01 submitting on <host>

3) Only log client commands if they are actually informational

At the moment we log all client commands, this system is up for a re-write with the move from zmq endpoints to GraphQL schema anyway.
Only log mutations? We don't really care about:
- put_messages
- GraphQL queries (e.g. from the UI Server or Tui)

- 2020-06-04T10:14:21Z INFO - [client-command] put_messages <user>@<host>:cylc-message
- 2020-06-04T10:14:26Z INFO - [client-command] graphql <user>@<host>.local:cylc-tui

4) Consider demoting triggering items to DEBUG

Users don't care about triggering.
For the most part neither do we!

- 2020-06-04T10:18:15Z INFO - [foot.20191209T1200Z] -triggered off []
+ 2020-06-04T10:18:15Z DEBUG - foot.20191209T1200Z - triggered off []

5) Demote task failure messages from CRITICAL to INFO

Failed is no more important than succeeded if the failure is handled correctly.
If the failure is not handled we will get a suite stalled warning anyway.
Remove the duplicate CRITICAL message whilst we are at it.

- 2020-06-04T09:54:10Z CRITICAL - [foo.1] status=running: (received)failed/EXIT at 2020-06-04T10:18:18Z for job(01)
- 2020-06-04T09:54:10Z CRITICAL - [foo.1] -job(01) failed
+ 2020-06-04T09:54:10Z INFO - foo.1 #01 failed (EXIT)

6) Consider demoting task messages to a new log level

Create a new logging level between DEBUG and INFO for task messages.
Thi will help to separate scheduler and task logging.

- 2020-06-04T09:54:10Z INFO - [foo.1] status=running: (received)succeeded at 2020-05-29T12:24:33+12:00 for job(01)
+ 2020-06-04T09:54:10Z TASK - foo.1 #01 succeeded
- 2020-06-04T09:54:10Z INFO - [bar.1] status=running: (received)failed/EXIT at 2020-06-04T10:18:18Z for job(01)
+ 2020-06-04T09:54:10Z TASK - bar.1 #01 failed (EXIT)

7) Consider tidying health check messages

This information is quite useful for debugging flows when polling issues occur so i'm cautious to remove.
Could the info be written somewhere else, e.g. the job activity log.

- 2020-06-04T10:16:36Z INFO - [foo.1] -health check settings: execution timeout=None

Or:

- 2020-06-04T10:16:36Z INFO - [foo.1] -health check settings: execution timeout=None
+ 2020-06-04T10:16:36Z DEBUG - foo.1 health check settings: execution timeout=None

8) Change `DEBUG` to `DEBG`

Mostly for OCD reasons - https://xkcd.com/276/
Also readability.

- 2020-06-04T10:45:45Z DEBUG - 4 task(s) de-queued
+ 2020-06-04T10:45:45Z DEBG - 4 task(s) de-queued
  2020-06-04T10:45:45Z INFO - [foo_cold.20191209T0900Z] -submit-num=01,
	<owner>@<host>

Example

Before

2020-06-04T10:16:33Z INFO - Run: (re)start=0 log=1                                   
2020-06-04T10:16:33Z INFO - Cylc version: 8.0a1
2020-06-04T10:16:33Z INFO - Run mode: live
2020-06-04T10:16:33Z INFO - Initial point: 20191209T0900Z
2020-06-04T10:16:33Z INFO - Final point: None
2020-06-04T10:16:33Z INFO - Cold Start 20191209T0900Z
2020-06-04T10:16:34Z INFO - [foo_cold.20191209T0900Z] -submit-num=01, owner@host=<host>
2020-06-04T10:16:34Z INFO - [foo.20191209T1200Z] -submit-num=01, owner@host=<host>
2020-06-04T10:16:34Z INFO - [foot.20191209T1200Z] -submit-num=01, owner@host=<host>
2020-06-04T10:16:34Z INFO - [fool.20191209T1200Z] -submit-num=01, owner@host=<host>
2020-06-04T10:16:34Z INFO - [foo_cold.20191209T0900Z] -triggered off []
2020-06-04T10:16:34Z INFO - [foo.20191209T1200Z] -triggered off []
2020-06-04T10:16:34Z INFO - [foot.20191209T1200Z] -triggered off []
2020-06-04T10:16:34Z INFO - [fool.20191209T1200Z] -triggered off []
2020-06-04T10:16:34Z INFO - [client-command] graphql <user>@<host>:cylc-tui
2020-06-04T10:16:35Z INFO - [foo.20191209T1200Z] status=ready: (internal)submitted at 2020-06-04T10:16:35Z for job(01)
2020-06-04T10:16:35Z INFO - [foo.20191209T1200Z] -health check settings: submission timeout=None
2020-06-04T10:16:35Z INFO - [foo_cold.20191209T0900Z] status=ready: (internal)submitted at 2020-06-04T10:16:35Z for job(01)
2020-06-04T10:16:35Z INFO - [foo_cold.20191209T0900Z] -health check settings: submission timeout=None
2020-06-04T10:16:35Z INFO - [fool.20191209T1200Z] status=ready: (internal)submitted at 2020-06-04T10:16:35Z for job(01)

After

2020-06-04T10:16:33Z INFO - Run: (re)start=0 log=1                                   
2020-06-04T10:16:33Z INFO - Cylc version: 8.0a1
2020-06-04T10:16:33Z INFO - Run mode: live
2020-06-04T10:16:33Z INFO - Initial point: 20191209T0900Z
2020-06-04T10:16:33Z INFO - Final point: None
2020-06-04T10:16:33Z INFO - Cold Start 20191209T0900Z
2020-06-04T10:16:34Z TASK - foo_cold.20191209T0900Z #01 submitting on <host>
2020-06-04T10:16:34Z TASK - foo.20191209T1200Z #01 submitting on <host>
2020-06-04T10:16:34Z TASK - foot.20191209T1200Z #01 submitting on <host>
2020-06-04T10:16:34Z TASK - fool.20191209T1200Z #01 submitting on <host>
2020-06-04T10:16:34Z DEBG - foo_cold.20191209T0900Z - triggered off []
2020-06-04T10:16:34Z DEBG - foo.20191209T1200Z - triggered off []
2020-06-04T10:16:34Z DEBG - foot.20191209T1200Z - triggered off []
2020-06-04T10:16:34Z DEBG - fool.20191209T1200Z - triggered off []
2020-06-04T10:16:35Z TASK - foo.20191209T1200Z submitted
2020-06-04T10:16:35Z TASK - foo_cold.20191209T0900Z submitted
2020-06-04T10:16:35Z TASK - fool.20191209T1200Z submitted

Jun 04 '20 10:06 oliver-sanders

@oliver-sanders, I agree with all suggestions above. I like the idea of using a TASK logging level for task messages, so long as they are not "demoted" in the sense that a deliberate action is required to turn that level on, because task state changes are pretty much the whole purpose of the scheduler.

Jun 04 '20 21:06 hjoliver

Also, not very difficult to implement. It will break a bunch of functional tests, but they'll be relatively easy to fix. I'll optimistically attach a "small" label :grin:

Jun 04 '20 22:06 hjoliver

Agreed, it is a small task, it's just that there are lots of loose ends to tidy up!

I like the idea of using a TASK logging level for task messages, so long as they are not "demoted

When we create the suite loggers we would set the level to TASK so they would be visible by default.

A curious side-effect would be the ability to create pure suite or task logs.

Jun 05 '20 09:06 oliver-sanders

Oh, and another one:

9) Remove the debug line which accompanies state changes.

Duplicate information, not needed

- 2020-06-11T15:24:27+01:00 INFO - [foo.1] status=ready: (internal)submitted at 2020-06-11T15:24:26+01:00 for job(01)
- 2020-06-11T15:24:27+01:00 DEBUG - [foo.1] -ready => submitted
+ 2020-06-11T15:24:27+01:00 TASK - foo.1 #01 submitted

Jun 11 '20 14:06 oliver-sanders

Post SoD I'll add another one:

10) Remove flow labels if only one flow is active.

Reflow is a niche feature which most users won't understand or use.
When there is only one active flow label there is no need to log it with every task event.
We should log when flow labels are created, stopped or merged.

- 2020-06-11T15:24:27+01:00 INFO - [foo.1] status=ready: (internal)submitted at 2020-06-11T15:24:26+01:00 for job(01) flow(X)
+ 2020-06-11T15:24:27+01:00 TASK - foo.1 #01 submitted

Aug 05 '20 08:08 oliver-sanders

Remove flow labels if only one flow is active.

So long as we're careful with the definition of "active" here. If we hold the main flow, then trigger a reflow, then resume the main flow later after the reflow is done, it should not look like (in the log) it was all the same flow because there was only ever one of them "active" at a time.

Aug 05 '20 10:08 hjoliver

So long as we're careful with the definition of "active" here

Would have thought that FlowLabelMgr.inuse should be sufficient?

Aug 05 '20 12:08 oliver-sanders

Yes, that will be sufficient :+1: (Just saying, "active" has to mean "there are tasks present with this flow label" - but those tasks don't themselves have to be "active"!)

Aug 06 '20 03:08 hjoliver

A new logging level between INFO and DEBUG would be useful for cylc clean -v. That would tell you what exactly files are being deleted without any of the other debug level stuff. This level would be more general than TASK; not sure what it would be called. Maybe VERBOSE~~, or NOTE~~? It might also be nice to have it in the same grey colour as debug messages. (Edit: actually NOTE sounds to me like something that would go in between INFO and WARNING)

Sep 06 '21 10:09 MetRonnie

(In case anyone plans to work on this in the near future, note there are some cosmetic changes to task proxy logging already coming in #4300).

Sep 06 '21 10:09 hjoliver

Tagging this against 8.0.0, it would be good to tackle some of this during the 8.0.0 stabilisation period.

Dec 14 '21 12:12 oliver-sanders

I'll assign myself - I'd intended to poke this anyways

Dec 14 '21 16:12 wxtim

Thoughts on updates to @oliver-sanders proposal in the light of looking at the current code

1) State change messages

remove time information
- It's already in the timestamp
- It's also recorded in the database
Remove previous state
- Unless changed unexpectedly
Remove task communication method
- Unless state changed due to poll and task communication method != poll
Use same #NN job number format as we use in the GUI & TUI.
Change the task proxy __str__ method:
- To reflect universal ID if it hasn't been done.
- To reflect the changeability of components in a similar way to ISODatetime: i.e. - Flow, ID (multiple tasks in a flow), submit (multiple submits per reflow), state (multiple state per submit)

- [foo.1 job:01 flows:1 preparing] (internal)submitted 2020-05-29T12:24:34+12:00
+ 2020-05-29T12:24:34+12:00 INFO - [flow:1 1/foo #01 submitted]

This should be accompanied by a slide in the course about interpreting this.

2) Make submitting messages comprehensible

Remove <user> from user@host
- We will no longer support this functionality and the user that Cylc reports might not be the same as the remote user anyway (due to SSH config).

- 2020-06-04T09:54:10Z INFO - [foo.1] -submit-num=01, owner@host=<user>@<host>
+ 2020-06-04T09:54:10Z INFO - [flow:1 foo.1 #01] submitting on <host>

5) Demote task failure messages from CRITICAL to INFO

@wxtim Votes for WARNING**

Failed is no more important than succeeded if the failure is handled correctly.
If the failure is not handled we will get a suite stalled warning anyway.
Remove the duplicate CRITICAL message whilst we are at it.

- 2020-06-04T09:54:10Z CRITICAL - [foo.1] status=running: (received)failed/EXIT at 2020-06-04T10:18:18Z for job(01)
- 2020-06-04T09:54:10Z CRITICAL - [foo.1] -job(01) failed
+ 2020-06-04T09:54:10Z INFO - foo.1 #01 failed (EXIT)

Dec 21 '21 10:12 wxtim

Ideas:

Don't log the flow numbers if no additional flows have been created (useless info).
Don't log the task state inside the [square brackets] (duplicate info).
Log task failures to INFO because if any compulsory outputs are not satisfied they will be logged separately.
Post Universal ID consider using the full Job ID e.g. [cycle/task/job] rather than [task.cycle #job].
The timestamp on state changes is only useful in the event of polling (when the event time may differ from the log time).

Dec 21 '21 11:12 oliver-sanders

Updated terminology (for logging of failures):

Failed is no more important than succeeded if the failure is handled correctly.

... if failure is expected or success is optional

If the failure is not handled we will get a suite stalled warning anyway

... if success is expected we will get an incomplete output warning on failure

Dec 21 '21 22:12 hjoliver

Ideas:

Don't log the flow numbers if no additional flows have been created (useless info).

True if the only flow number in the whole run is 1, but as soon as another flow gets created we need to start logging it even if only one flow is present, because two flows can cover the same parts of the graph at entirely separate wall-clock times. So I think it's best to log it all the time for consistency, which makes log parsing easier. So let's just make it really concise.

Don't log the task state inside the [square brackets] (duplicate info).

It's not really duplicate info, because scanning back up the log to find the most recent previous state change is painful. And at state transitions we log [current state] => new state. However, I think we can drop it on the basis that the state can generally be inferred from the logged event.

Log task failures to INFO because if any compulsory outputs are not satisfied they will be logged separately

:+1:

Post Universal ID consider using the full Job ID e.g. [cycle/task/job] rather than [task.cycle #job].

:+1:

The timestamp on state changes is only useful in the event of polling (when the event time may differ from the log time).

:+1:

Dec 21 '21 22:12 hjoliver

flow number logging proposal

On current master, flow numbers are not part of the task/job ID so should probably go outside of the square brackets.

[cycle/task/job] (flows) MESSAGE  # e.g. [2025/foo/01] (1)  or [2025/foo/01] (1,2,4)

Note (1) takes up almost no space, so we might as well log it for consistency even if no other flow was started. (We can document what it means, of course, just like job submit number etc.).

Dec 21 '21 23:12 hjoliver

Note https://github.com/cylc/cylc-flow/issues/4419#issuecomment-999181650 may have an impact flow number logging (and on task ID).

Dec 22 '21 00:12 hjoliver

I think a good compromise would be to make the original flow (1) implicit and only log the additional flows that the user has manually created.

(obviously the combined flow "1,2" would need to be logged).

This keeps things simple for simple cases (the overwhelming majority) and displays only the modifications that the user has created.

We may be able to shim the flow number into the UID to make this more standard, note this would move the flow inside the square brackets something like:

[^flow/cycle/task/job]
[^2/1234/foo/01]
# or
[flow^cycle/task/job]
[2^1234/foo/01]

Dec 22 '21 10:12 oliver-sanders

I think a good compromise would be to make the original flow (1) implicit and only log the additional flows that the user has manually created. This keeps things simple for simple cases (the overwhelming majority) and displays only the modifications that the user has created.

Could do but I'm not convinced. The same argument could (more or less) be made for job submit number, which rarely goes beyond 01. There are advantages to consistent logging at all times (esp. for log parsing).

But if we do make flow 1 implicit:

(obviously the combined flow "1,2" would need to be logged).

Also, single non-1 flows would need to be logged: (2)... otherwise there's potential for confusing log analysis, since flows 1 and 2 could overlap in the graph even if they don't exist at the same time.

Jan 05 '22 22:01 hjoliver

Demote task failure messages from CRITICAL to INFO Failed is no more important than succeeded if the failure is handled correctly.

Can we go for warning? I'm not sure I think that every path is equal, and would like to give this more weight...

Jan 13 '22 18:01 wxtim

I'd say, failure is no more important than success for scheduling (for the graph), if handled.

But for logging purposes failure should still be highlighted. It shows, by definition, that a task failed to do what it is supposed to do, and even if the failure is handled we might still want to know that handling was required. E.g. if an occasional failure becomes a regular event maybe the failing task should be modified rather than "handled".

Jan 13 '22 21:01 hjoliver

[ ] State change messages (#5081)
[x] Make submitting messages comprehendible
[x] Remove <user> from user@host
[x] Only log client commands if they are actually informational (#4522)
[x] Consider demoting triggering items to DEBUG #4578
[ ] Demote task failure messages from CRITICAL to INFO
[ ] Consider demoting task messages to a new log level
[ ] Consider tidying health check messages
[ ] Change DEBUG to DEBG

Jan 14 '22 08:01 wxtim

[ ] Demote task failure messages from CRITICAL to INFO

In light of my comment above, this needs arguing. IMO it stays CRITICAL or at least WARNING in the log.

Jan 16 '22 22:01 hjoliver

If we shorten DEBUG I would prefer DBUG

Aug 16 '22 15:08 MetRonnie

This pattern of log messages is somewhat confusing:

INFO - [1/a:waiting] => waiting(queued)
INFO - [1/a:waiting(queued)] => waiting
INFO - [1/a:waiting] => preparing

This means that the task went through the queue and was then prepared for submission. But the => waiting makes it look like it went back to the unqueued state. This is simply reflecting what the internal implementation does.

But we should be able to go straight from waiting(queued) to preparing without going back to waiting in the mean time.

Jan 22 '24 11:01 oliver-sanders

Agreed, that's annoying, but it might be somewhat tricky to solve because tasks don't necessarily go direct from queued to preparing. A task waiting on other prerequisites, or xtriggers, may not be queued.

It might be easier, and still an improvement to skip the initial plain waiting message:

INFO - [1/a:waiting(queued)] => waiting
INFO - [1/a:waiting] => preparing

... because (as I recall) all tasks go to queued at first, and are then immediately released to waiting if not currently queue-limited.

Jan 23 '24 00:01 hjoliver

[from offline chat] it would be really nice to have at least one "verbose" logging level between INFO and DEBUG.

[This was already suggested somewhere above on this issue by @MetRonnie so here's my take on it].

Note 3rd party package: https://pypi.org/project/verboselogs and https://github.com/xolox/python-verboselogs It is a trivial extension of Python logging, which we could easily do ourselves to avoid the dependency.

Then I'd suggest, something like this:

INFO (for users): minimal trace of normal task events and state changes
VERBOSE (for users): plus additional informative details such as tasks entering and exiting the scheduler task pool, "health" messages detailing next polling times, etc.
DEBUG (for developers): plus the extreme verbosity, line numbers, and tracebacks, needed for literally debugging the scheduler (which I find pretty painful to read under most circumstances - it shouldn't be needed for just following events in more detail)
(WARNING and CRITICAL as now)

Apr 23 '24 23:04 hjoliver

cylc-flow cylc-flow copied to clipboard

Cleaner log messages.

1) State change messages

2) Make submitting messages comprehendible

3) Only log client commands if they are actually informational

4) Consider demoting triggering items to DEBUG

5) Demote task failure messages from CRITICAL to INFO

6) Consider demoting task messages to a new log level

7) Consider tidying health check messages

8) Change DEBUG to DEBG

Example

9) Remove the debug line which accompanies state changes.

10) Remove flow labels if only one flow is active.

1) State change messages

2) Make submitting messages comprehensible

5) Demote task failure messages from CRITICAL to INFO

flow number logging proposal

cylc-flow
cylc-flow copied to clipboard

8) Change `DEBUG` to `DEBG`