rail icon indicating copy to clipboard operation
rail copied to clipboard

missing isofrags.tar.gz in parallel mode (v0.2.3b)

Open jfear opened this issue 8 years ago • 2 comments

Not sure if this is a bug or specific for my use case.

When running rail in parallel mode using ipcluster with Slurm I get a RuntimeError that isofrags.tar.gz does not exist. If I restart from that point everything finishes cleanly.

If I run rail in parallel on a single node with ipcluster (i.e. local instead of slurm) everything runs cleanly.

I am guessing it has something to do with using slurm. Probably not your problem, only bring it up because there is a mention of this in a commit log on the parallel branch. Please let me know if you have a known fix or a suggestion what might be going on.

Thanks Justin

jfear avatar Apr 18 '16 19:04 jfear

Thanks for the bug report! So the error output is exactly The file isofrags.tar.gz does not exist and thus cannot be cached.?

Sounds like a race condition. Still somewhat mysterious to me, but in dooplicity/emr_simulator.py try replacing

            if not os.path.isfile(file_or_archive):
                iface.fail(('The file %s does not exist and thus cannot '
                            'be cached.') % file_or_archive,
                            steps=(job_flow[step_number:]
                                        if step_number != 0 else None))
                failed = True
                raise RuntimeError

(lines 1422-1427) with something like

            retries = 0
            while not os.path.isfile(file_or_archive):
                time.sleep(1)
                retries += 1
                if retries > 5: break
            if not os.path.isfile(file_or_archive):
                iface.fail(('The file %s does not exist and thus cannot '
                            'be cached.') % file_or_archive,
                            steps=(job_flow[step_number:]
                                        if step_number != 0 else None))
                failed = True
                raise RuntimeError

and let me know what happens.

nellore avatar Apr 18 '16 20:04 nellore

This fixes the problem #37

jfear avatar May 26 '16 14:05 jfear