cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

Log retrieval issues

Open dpmatthews opened this issue 5 years ago • 1 comments

Several users have reported issues associated with log retrieval (using Cylc 7.8.3 and previous versions). These relate to our HPC which uses PBS. We configure: retrieve job logs retry delays = PT10S, PT30S, PT3M This is to deal with the fact that there can be a considerable delay before the job log files (out and err) appear in the log directory.

The issues are as follows:

  1. Files with the wrong permissions (600) or missing. The example I saw had the wrong permissions on the out file and the err file was missing, My guess is that this can happen if PBS is part way through writing the log files when the retrieval starts. I think we use existence of the out file to determine if the retrieval can start. We need to try to get evidence to confirm if this is the cause and investigate whether there is a better method to confirm the logs are ready.

  2. Other missing log files. It is not clear whether issue 1 accounts for all the reports of missing log files. Another possibility is that the file is too big (we set retrieve job logs max size = 32M). It would help if could record in the job-activity.log if this happens (probably not easy since I think it's implemented via an rsync option?).

  3. Log files not available from the GUI. This problem happens when a task fails and the user tries to access the out or err files but finds them unavailable and has to retry several times. Presumably, once the task fails, the GUI expects to find the log files locally rather than accessing the remote system. Ideally the GUI would continue to access the log files remotely until the log file retrieval has completed.

dpmatthews avatar Oct 29 '19 20:10 dpmatthews