sisyphus icon indicating copy to clipboard operation
sisyphus copied to clipboard

WIP: Check size of input files to handle nfs sync issues

Open mirkovogel opened this issue 3 years ago • 5 comments

After a discussion with @critias , we opted for the following design, handling sync issues both between jobs and between tasks.

  • When a task finishes, the LoggingThread writes a the size and the mtime of all files below work and output to the ressources file (-> Job._sis_get_file_stats)
  • When a task is set up, it calls Task._wait_for_input_to_sync. There the expected sizes are obtained by calling Job._sis_get_expected_file_sizes(job_dir, task) ...
    • for all jobs it depends upon, if it is the first task (then only the files lists as inputs are retained)
    • for the preceding task otherwise.
  • The expected file sizes are then compared to the actual sizes.

There are two new config keys:

  • WAIT_PERIOD_CHECK_FILE_SIZE
  • MAX_WAIT_FILE_SYNC

mirkovogel avatar Jul 02 '21 09:07 mirkovogel

Note that the finished file is also put in the finished.tar.gz file at job cleanup.

curufinwe avatar Jul 02 '21 12:07 curufinwe

Side note: This change does not break old setups. If no size info is available for a given job / task, no checks are run.

mirkovogel avatar Jul 07 '21 10:07 mirkovogel

@critias : I wanted to get this PR merged before "vacation" (=no kindergarden) starts, which didn't happen because I spent last week in bed, not in front of a computer screen. As I won't be able to do so until 8/13, I invite you to take over this PR. :-)

mirkovogel avatar Jul 15 '21 16:07 mirkovogel

@critias I got sidetracked for quite some time ... How about the current situation of the cluster? Is it running so smoothly that this PR has become obsolete? Starting next week I'd have time to implement the changes you suggested.

mirkovogel avatar Sep 13 '21 19:09 mirkovogel

I continued working on it here: https://github.com/rwth-i6/sisyphus/tree/check-output-size It seems to be working ok for me, but I got also sidetracked since the overall situation got better. Let's have a call discuss how to continue from here.

critias avatar Sep 14 '21 08:09 critias