sisyphus
sisyphus copied to clipboard
WIP: Check size of input files to handle nfs sync issues
After a discussion with @critias , we opted for the following design, handling sync issues both between jobs and between tasks.
- When a task finishes, the
LoggingThread
writes a the size and the mtime of all files belowwork
andoutput
to theressources
file (->Job._sis_get_file_stats
) - When a task is set up, it calls
Task._wait_for_input_to_sync
. There the expected sizes are obtained by callingJob._sis_get_expected_file_sizes(job_dir, task)
...- for all jobs it depends upon, if it is the first task (then only the files lists as inputs are retained)
- for the preceding task otherwise.
- The expected file sizes are then compared to the actual sizes.
There are two new config keys:
- WAIT_PERIOD_CHECK_FILE_SIZE
- MAX_WAIT_FILE_SYNC
Note that the finished file is also put in the finished.tar.gz file at job cleanup.
Side note: This change does not break old setups. If no size info is available for a given job / task, no checks are run.
@critias : I wanted to get this PR merged before "vacation" (=no kindergarden) starts, which didn't happen because I spent last week in bed, not in front of a computer screen. As I won't be able to do so until 8/13, I invite you to take over this PR. :-)
@critias I got sidetracked for quite some time ... How about the current situation of the cluster? Is it running so smoothly that this PR has become obsolete? Starting next week I'd have time to implement the changes you suggested.
I continued working on it here: https://github.com/rwth-i6/sisyphus/tree/check-output-size It seems to be working ok for me, but I got also sidetracked since the overall situation got better. Let's have a call discuss how to continue from here.