rail
rail copied to clipboard
missing isofrags.tar.gz in parallel mode (v0.2.3b)
Not sure if this is a bug or specific for my use case.
When running rail in parallel mode using ipcluster with Slurm I get a RuntimeError that isofrags.tar.gz does not exist. If I restart from that point everything finishes cleanly.
If I run rail in parallel on a single node with ipcluster (i.e. local instead of slurm) everything runs cleanly.
I am guessing it has something to do with using slurm. Probably not your problem, only bring it up because there is a mention of this in a commit log on the parallel branch. Please let me know if you have a known fix or a suggestion what might be going on.
Thanks Justin
Thanks for the bug report! So the error output is exactly The file isofrags.tar.gz does not exist and thus cannot be cached.
?
Sounds like a race condition. Still somewhat mysterious to me, but in dooplicity/emr_simulator.py
try replacing
if not os.path.isfile(file_or_archive):
iface.fail(('The file %s does not exist and thus cannot '
'be cached.') % file_or_archive,
steps=(job_flow[step_number:]
if step_number != 0 else None))
failed = True
raise RuntimeError
(lines 1422-1427) with something like
retries = 0
while not os.path.isfile(file_or_archive):
time.sleep(1)
retries += 1
if retries > 5: break
if not os.path.isfile(file_or_archive):
iface.fail(('The file %s does not exist and thus cannot '
'be cached.') % file_or_archive,
steps=(job_flow[step_number:]
if step_number != 0 else None))
failed = True
raise RuntimeError
and let me know what happens.
This fixes the problem #37