cylc-flow
cylc-flow copied to clipboard
Cleaner log messages.
At INFO level, we should just log the timestamped (optionally: #3646) sequence of events in the simplest way possible, unless something goes wrong. There's too much usually-irrelevant crap in there.
E.g. (with two timestamps removed):
INFO - [zz.1] status=running: (received)succeeded for job(01)
This should be something like:
INFO - zz.1 (job 01) succeeded
No need to log current status unless it is not the expected status.
And no need to log "received" either, just modify slightly for the polled case:
INFO - zz.1 (job 01) polled succeeded
Pull requests welcome!
With timestamps, these log lines are really verbose:
2020-05-29T12:24:34+12:00 INFO - [y.1] status=running: (received)succeeded at 2020-05-29T12:24:33+12:00 for job(02)
We could at least put the second timestamp ("recieved at"), which is normally the least relevant bit of info, at the end of the line:
2020-05-29T12:24:34+12:00 INFO - foo.1 (job 01) succeeded at 2020-05-29T12:24:33+12:00
And maybe even omit it if it is sufficiently similar to the log timestamp (within a few seconds, say):
2020-05-29T12:24:34+12:00 INFO - foo.1 (job 01) succeeded
We should definitely clean up log messages, the only real barrier to doing so has been the functional tests which make exact comparisons to log messages and often do nasty things like awk
and sed
which makes tests hard to fix.
Here are my thoughts on cleaning the log files:
1) State change messages
- remove time information
- It's already in the timestamp
- It's also recorded in the database
- Remove pervious state
- Unless changed unexpectedly
- Remove task communication method
- Unless state changed due to poll and
task communication method != poll
- Unless state changed due to poll and
- Use same
#NN
job number format as we use in the GUI & TUI.
- 2020-05-29T12:24:34+12:00 INFO - [foo.1] status=running: (received)succeeded at 2020-05-29T12:24:33+12:00 for job(01)
+ 2020-05-29T12:24:34+12:00 INFO - foo.1 #01 succeeded
2) Make submitting messages comprehendible
- Remove
<user>
fromuser@host
- We will no longer support this functionality and the user that Cylc reports might not be the same as the remote user anyway (due to SSH config).
- 2020-06-04T09:54:10Z INFO - [foo.1] -submit-num=01, owner@host=<user>@<host>
+ 2020-06-04T09:54:10Z INFO - foo.1 #01 submitting on <host>
3) Only log client commands if they are actually informational
- At the moment we log all client commands, this system is up for a re-write with the move from zmq endpoints to GraphQL schema anyway.
- Only log mutations? We don't really care about:
-
put_messages
- GraphQL queries (e.g. from the UI Server or Tui)
-
- 2020-06-04T10:14:21Z INFO - [client-command] put_messages <user>@<host>:cylc-message
- 2020-06-04T10:14:26Z INFO - [client-command] graphql <user>@<host>.local:cylc-tui
4) Consider demoting triggering items to DEBUG
- Users don't care about triggering.
- For the most part neither do we!
- 2020-06-04T10:18:15Z INFO - [foot.20191209T1200Z] -triggered off []
+ 2020-06-04T10:18:15Z DEBUG - foot.20191209T1200Z - triggered off []
5) Demote task failure messages from CRITICAL to INFO
- Failed is no more important than succeeded if the failure is handled correctly.
- If the failure is not handled we will get a suite stalled warning anyway.
- Remove the duplicate CRITICAL message whilst we are at it.
- 2020-06-04T09:54:10Z CRITICAL - [foo.1] status=running: (received)failed/EXIT at 2020-06-04T10:18:18Z for job(01)
- 2020-06-04T09:54:10Z CRITICAL - [foo.1] -job(01) failed
+ 2020-06-04T09:54:10Z INFO - foo.1 #01 failed (EXIT)
6) Consider demoting task messages to a new log level
- Create a new logging level between DEBUG and INFO for task messages.
- Thi will help to separate scheduler and task logging.
- 2020-06-04T09:54:10Z INFO - [foo.1] status=running: (received)succeeded at 2020-05-29T12:24:33+12:00 for job(01)
+ 2020-06-04T09:54:10Z TASK - foo.1 #01 succeeded
- 2020-06-04T09:54:10Z INFO - [bar.1] status=running: (received)failed/EXIT at 2020-06-04T10:18:18Z for job(01)
+ 2020-06-04T09:54:10Z TASK - bar.1 #01 failed (EXIT)
7) Consider tidying health check messages
- This information is quite useful for debugging flows when polling issues occur so i'm cautious to remove.
- Could the info be written somewhere else, e.g. the job activity log.
- 2020-06-04T10:16:36Z INFO - [foo.1] -health check settings: execution timeout=None
Or:
- 2020-06-04T10:16:36Z INFO - [foo.1] -health check settings: execution timeout=None
+ 2020-06-04T10:16:36Z DEBUG - foo.1 health check settings: execution timeout=None
8) Change DEBUG
to DEBG
- Mostly for OCD reasons - https://xkcd.com/276/
- Also readability.
- 2020-06-04T10:45:45Z DEBUG - 4 task(s) de-queued
+ 2020-06-04T10:45:45Z DEBG - 4 task(s) de-queued
2020-06-04T10:45:45Z INFO - [foo_cold.20191209T0900Z] -submit-num=01,
<owner>@<host>
Example
Before
2020-06-04T10:16:33Z INFO - Run: (re)start=0 log=1
2020-06-04T10:16:33Z INFO - Cylc version: 8.0a1
2020-06-04T10:16:33Z INFO - Run mode: live
2020-06-04T10:16:33Z INFO - Initial point: 20191209T0900Z
2020-06-04T10:16:33Z INFO - Final point: None
2020-06-04T10:16:33Z INFO - Cold Start 20191209T0900Z
2020-06-04T10:16:34Z INFO - [foo_cold.20191209T0900Z] -submit-num=01, owner@host=<host>
2020-06-04T10:16:34Z INFO - [foo.20191209T1200Z] -submit-num=01, owner@host=<host>
2020-06-04T10:16:34Z INFO - [foot.20191209T1200Z] -submit-num=01, owner@host=<host>
2020-06-04T10:16:34Z INFO - [fool.20191209T1200Z] -submit-num=01, owner@host=<host>
2020-06-04T10:16:34Z INFO - [foo_cold.20191209T0900Z] -triggered off []
2020-06-04T10:16:34Z INFO - [foo.20191209T1200Z] -triggered off []
2020-06-04T10:16:34Z INFO - [foot.20191209T1200Z] -triggered off []
2020-06-04T10:16:34Z INFO - [fool.20191209T1200Z] -triggered off []
2020-06-04T10:16:34Z INFO - [client-command] graphql <user>@<host>:cylc-tui
2020-06-04T10:16:35Z INFO - [foo.20191209T1200Z] status=ready: (internal)submitted at 2020-06-04T10:16:35Z for job(01)
2020-06-04T10:16:35Z INFO - [foo.20191209T1200Z] -health check settings: submission timeout=None
2020-06-04T10:16:35Z INFO - [foo_cold.20191209T0900Z] status=ready: (internal)submitted at 2020-06-04T10:16:35Z for job(01)
2020-06-04T10:16:35Z INFO - [foo_cold.20191209T0900Z] -health check settings: submission timeout=None
2020-06-04T10:16:35Z INFO - [fool.20191209T1200Z] status=ready: (internal)submitted at 2020-06-04T10:16:35Z for job(01)
After
2020-06-04T10:16:33Z INFO - Run: (re)start=0 log=1
2020-06-04T10:16:33Z INFO - Cylc version: 8.0a1
2020-06-04T10:16:33Z INFO - Run mode: live
2020-06-04T10:16:33Z INFO - Initial point: 20191209T0900Z
2020-06-04T10:16:33Z INFO - Final point: None
2020-06-04T10:16:33Z INFO - Cold Start 20191209T0900Z
2020-06-04T10:16:34Z TASK - foo_cold.20191209T0900Z #01 submitting on <host>
2020-06-04T10:16:34Z TASK - foo.20191209T1200Z #01 submitting on <host>
2020-06-04T10:16:34Z TASK - foot.20191209T1200Z #01 submitting on <host>
2020-06-04T10:16:34Z TASK - fool.20191209T1200Z #01 submitting on <host>
2020-06-04T10:16:34Z DEBG - foo_cold.20191209T0900Z - triggered off []
2020-06-04T10:16:34Z DEBG - foo.20191209T1200Z - triggered off []
2020-06-04T10:16:34Z DEBG - foot.20191209T1200Z - triggered off []
2020-06-04T10:16:34Z DEBG - fool.20191209T1200Z - triggered off []
2020-06-04T10:16:35Z TASK - foo.20191209T1200Z submitted
2020-06-04T10:16:35Z TASK - foo_cold.20191209T0900Z submitted
2020-06-04T10:16:35Z TASK - fool.20191209T1200Z submitted
@oliver-sanders, I agree with all suggestions above. I like the idea of using a TASK logging level for task messages, so long as they are not "demoted" in the sense that a deliberate action is required to turn that level on, because task state changes are pretty much the whole purpose of the scheduler.
Also, not very difficult to implement. It will break a bunch of functional tests, but they'll be relatively easy to fix. I'll optimistically attach a "small" label :grin:
Agreed, it is a small task, it's just that there are lots of loose ends to tidy up!
I like the idea of using a TASK logging level for task messages, so long as they are not "demoted
When we create the suite loggers we would set the level to TASK
so they would be visible by default.
A curious side-effect would be the ability to create pure suite or task logs.
Oh, and another one:
9) Remove the debug line which accompanies state changes.
- Duplicate information, not needed
- 2020-06-11T15:24:27+01:00 INFO - [foo.1] status=ready: (internal)submitted at 2020-06-11T15:24:26+01:00 for job(01)
- 2020-06-11T15:24:27+01:00 DEBUG - [foo.1] -ready => submitted
+ 2020-06-11T15:24:27+01:00 TASK - foo.1 #01 submitted
Post SoD I'll add another one:
10) Remove flow labels if only one flow is active.
- Reflow is a niche feature which most users won't understand or use.
- When there is only one active flow label there is no need to log it with every task event.
- We should log when flow labels are created, stopped or merged.
- 2020-06-11T15:24:27+01:00 INFO - [foo.1] status=ready: (internal)submitted at 2020-06-11T15:24:26+01:00 for job(01) flow(X)
+ 2020-06-11T15:24:27+01:00 TASK - foo.1 #01 submitted
- Remove flow labels if only one flow is active.
So long as we're careful with the definition of "active" here. If we hold the main flow, then trigger a reflow, then resume the main flow later after the reflow is done, it should not look like (in the log) it was all the same flow because there was only ever one of them "active" at a time.
So long as we're careful with the definition of "active" here
Would have thought that FlowLabelMgr.inuse
should be sufficient?
Yes, that will be sufficient :+1: (Just saying, "active" has to mean "there are tasks present with this flow label" - but those tasks don't themselves have to be "active"!)
A new logging level between INFO
and DEBUG
would be useful for cylc clean -v
. That would tell you what exactly files are being deleted without any of the other debug level stuff. This level would be more general than TASK
; not sure what it would be called. Maybe VERBOSE
, or ? It might also be nice to have it in the same grey colour as debug messages. (Edit: actually NOTE
NOTE
sounds to me like something that would go in between INFO
and WARNING
)
(In case anyone plans to work on this in the near future, note there are some cosmetic changes to task proxy logging already coming in #4300).
Tagging this against 8.0.0, it would be good to tackle some of this during the 8.0.0 stabilisation period.
I'll assign myself - I'd intended to poke this anyways
Thoughts on updates to @oliver-sanders proposal in the light of looking at the current code
1) State change messages
- remove time information
- It's already in the timestamp
- It's also recorded in the database
- Remove previous state
- Unless changed unexpectedly
- Remove task communication method
- Unless state changed due to poll and
task communication method != poll
- Unless state changed due to poll and
- Use same
#NN
job number format as we use in the GUI & TUI. - Change the task proxy
__str__
method:- To reflect universal ID if it hasn't been done.
- To reflect the changeability of components in a similar way to ISODatetime: i.e. - Flow, ID (multiple tasks in a flow), submit (multiple submits per reflow), state (multiple state per submit)
- [foo.1 job:01 flows:1 preparing] (internal)submitted 2020-05-29T12:24:34+12:00
+ 2020-05-29T12:24:34+12:00 INFO - [flow:1 1/foo #01 submitted]
- This should be accompanied by a slide in the course about interpreting this.
2) Make submitting messages comprehensible
- Remove
<user>
fromuser@host
- We will no longer support this functionality and the user that Cylc reports might not be the same as the remote user anyway (due to SSH config).
- 2020-06-04T09:54:10Z INFO - [foo.1] -submit-num=01, owner@host=<user>@<host>
+ 2020-06-04T09:54:10Z INFO - [flow:1 foo.1 #01] submitting on <host>
5) Demote task failure messages from CRITICAL to INFO
@wxtim Votes for WARNING**
- Failed is no more important than succeeded if the failure is handled correctly.
- If the failure is not handled we will get a suite stalled warning anyway.
- Remove the duplicate CRITICAL message whilst we are at it.
- 2020-06-04T09:54:10Z CRITICAL - [foo.1] status=running: (received)failed/EXIT at 2020-06-04T10:18:18Z for job(01)
- 2020-06-04T09:54:10Z CRITICAL - [foo.1] -job(01) failed
+ 2020-06-04T09:54:10Z INFO - foo.1 #01 failed (EXIT)
Ideas:
- Don't log the flow numbers if no additional flows have been created (useless info).
- Don't log the task state inside the [square brackets] (duplicate info).
- Log task failures to INFO because if any compulsory outputs are not satisfied they will be logged separately.
- Post Universal ID consider using the full Job ID e.g.
[cycle/task/job]
rather than[task.cycle #job]
. - The timestamp on state changes is only useful in the event of polling (when the event time may differ from the log time).
Updated terminology (for logging of failures):
Failed is no more important than succeeded if the failure is handled correctly.
- ... if failure is expected or success is optional
If the failure is not handled we will get a suite stalled warning anyway
- ... if success is expected we will get an incomplete output warning on failure
Ideas:
- Don't log the flow numbers if no additional flows have been created (useless info).
True if the only flow number in the whole run is 1
, but as soon as another flow gets created we need to start logging it even if only one flow is present, because two flows can cover the same parts of the graph at entirely separate wall-clock times. So I think it's best to log it all the time for consistency, which makes log parsing easier. So let's just make it really concise.
- Don't log the task state inside the [square brackets] (duplicate info).
It's not really duplicate info, because scanning back up the log to find the most recent previous state change is painful. And at state transitions we log [current state] => new state
. However, I think we can drop it on the basis that the state can generally be inferred from the logged event.
- Log task failures to INFO because if any compulsory outputs are not satisfied they will be logged separately
:+1:
- Post Universal ID consider using the full Job ID e.g.
[cycle/task/job]
rather than[task.cycle #job]
.
:+1:
- The timestamp on state changes is only useful in the event of polling (when the event time may differ from the log time).
:+1:
flow number logging proposal
On current master, flow numbers are not part of the task/job ID so should probably go outside of the square brackets.
[cycle/task/job] (flows) MESSAGE # e.g. [2025/foo/01] (1) or [2025/foo/01] (1,2,4)
Note (1)
takes up almost no space, so we might as well log it for consistency even if no other flow was started. (We can document what it means, of course, just like job submit number etc.).
Note https://github.com/cylc/cylc-flow/issues/4419#issuecomment-999181650 may have an impact flow number logging (and on task ID).
I think a good compromise would be to make the original flow (1) implicit and only log the additional flows that the user has manually created.
(obviously the combined flow "1,2" would need to be logged).
This keeps things simple for simple cases (the overwhelming majority) and displays only the modifications that the user has created.
We may be able to shim the flow number into the UID to make this more standard, note this would move the flow inside the square brackets something like:
[^flow/cycle/task/job]
[^2/1234/foo/01]
# or
[flow^cycle/task/job]
[2^1234/foo/01]
I think a good compromise would be to make the original flow (1) implicit and only log the additional flows that the user has manually created. This keeps things simple for simple cases (the overwhelming majority) and displays only the modifications that the user has created.
Could do but I'm not convinced. The same argument could (more or less) be made for job submit number, which rarely goes beyond 01
. There are advantages to consistent logging at all times (esp. for log parsing).
But if we do make flow 1 implicit:
(obviously the combined flow "1,2" would need to be logged).
Also, single non-1
flows would need to be logged: (2)
... otherwise there's potential for confusing log analysis, since flows 1 and 2 could overlap in the graph even if they don't exist at the same time.
- Demote task failure messages from CRITICAL to INFO Failed is no more important than succeeded if the failure is handled correctly.
Can we go for warning? I'm not sure I think that every path is equal, and would like to give this more weight...
I'd say, failure is no more important than success for scheduling (for the graph), if handled.
But for logging purposes failure should still be highlighted. It shows, by definition, that a task failed to do what it is supposed to do, and even if the failure is handled we might still want to know that handling was required. E.g. if an occasional failure becomes a regular event maybe the failing task should be modified rather than "handled".
- [ ] State change messages (#5081)
- [x] Make submitting messages comprehendible
- [x] Remove
<user>
fromuser@host
- [x] Only log client commands if they are actually informational (#4522)
- [x] Consider demoting triggering items to DEBUG #4578
- [ ] Demote task failure messages from CRITICAL to INFO
- [ ] Consider demoting task messages to a new log level
- [ ] Consider tidying health check messages
- [ ] Change
DEBUG
toDEBG
[ ] Demote task failure messages from CRITICAL to INFO
In light of my comment above, this needs arguing. IMO it stays CRITICAL or at least WARNING in the log.
If we shorten DEBUG
I would prefer DBUG
This pattern of log messages is somewhat confusing:
INFO - [1/a:waiting] => waiting(queued)
INFO - [1/a:waiting(queued)] => waiting
INFO - [1/a:waiting] => preparing
This means that the task went through the queue and was then prepared for submission. But the => waiting
makes it look like it went back to the unqueued state. This is simply reflecting what the internal implementation does.
But we should be able to go straight from waiting(queued)
to preparing
without going back to waiting in the mean time.
Agreed, that's annoying, but it might be somewhat tricky to solve because tasks don't necessarily go direct from queued to preparing. A task waiting on other prerequisites, or xtriggers, may not be queued.
It might be easier, and still an improvement to skip the initial plain waiting message:
INFO - [1/a:waiting(queued)] => waiting
INFO - [1/a:waiting] => preparing
... because (as I recall) all tasks go to queued at first, and are then immediately released to waiting if not currently queue-limited.
[from offline chat] it would be really nice to have at least one "verbose" logging level between INFO and DEBUG.
[This was already suggested somewhere above on this issue by @MetRonnie so here's my take on it].
Note 3rd party package: https://pypi.org/project/verboselogs and https://github.com/xolox/python-verboselogs It is a trivial extension of Python logging, which we could easily do ourselves to avoid the dependency.
Then I'd suggest, something like this:
- INFO (for users): minimal trace of normal task events and state changes
- VERBOSE (for users): plus additional informative details such as tasks entering and exiting the scheduler task pool, "health" messages detailing next polling times, etc.
- DEBUG (for developers): plus the extreme verbosity, line numbers, and tracebacks, needed for literally debugging the scheduler (which I find pretty painful to read under most circumstances - it shouldn't be needed for just following events in more detail)
- (WARNING and CRITICAL as now)