bioscons
bioscons copied to clipboard
Add ensure_exists option to ensure filesystem is in sync before Command action returns
Some of us have noticed that occasionally scons thinks that files have changed and need to rebuild, even when this should not be the case. This can be particularly annoying with long running jobs, or jobs with some degree of randomness, as this can lead to all downstream targets being rebuilt unnecessarily.
After some snooping around, I've discovered that this only seems to happen when running on the cluster, and specifically seems to be related to the parental scons process not seeing the changes to the file(system), and conseuquently reading an incorrect (presumably null) MD5 hash.
This problem can be solved by appending appending an action to the end of the command string that ensures that the file exists before returning. The ideal solution would require that a flag be set on SlurmEnvironment to turn on this behavior if desired, defaulting to the current behavior otherwise. It should also be possible to turn this on or off on a specific Command, as well as specify the max wait time.
Minimal reproducible test case -
tgt1 = env.Command(path.join(outdir, 'first.csv'), input,
'csvcut -C other $SOURCE > $TARGET', use_cluster=False)
tgt2 = env.Command(path.join(outdir, 'second.csv'), tgt1,
'csvsort -c this $SOURCE > $TARGET', use_cluster=True)
tgt3 = env.Command(path.join(outdir, 'third.csv'), [tgt2, input],
'csvjoin -c this $SOURCES > $TARGET', use_cluster=False)
Note that running scons
on this Sconscript, followed by scons --debug explain -n
upon completion leads to a message indicating that the target has changed and needs to be rebuilt.