fireworks
fireworks copied to clipboard
task-level recovery might not work if a FireTask changes dirs
See:
https://github.com/hackingmaterials/atomate/issues/217
I already fixed it so that FW_offline.json is loaded from the correct place. But it looks like the checkpoint object always goes to the "launch_dir" (i.e., the initial launch dir) to run recovery tasks. However, if one of the tasks in a FireWork changed directories (e.g., task 3), then subsequent tasks (e.g., tasks 4-6) should be run in that changed directory. So if you do task-level rerun at task 4, it should be in that changed directory.
It seems to me that _prev_dir
in the checkpointing dict should not be created by LaunchPad but rather using os.get_cwd() when the rocket does the checkpointing. Or, there should be both _prev_dir
(i.e., launch_dir...) and _cur_dir
(directory of the most recently executing task) in the checkpoint object. When doing task-level recovery, one should use the latter as the directory.
@montoyjh Any thoughts? A lot of this was your code so want to make sure it plays nicely with your knowledge
Is it possible to get the absolute path of the FW_offline.json, rather than the relative path, and use that for the remainder of the firework?
EDIT: Nevermind, didn't understand the issue at first.
Yeah, I think having os.getcwd() define the directory to be changed into on a task level recovery should be good (this assumes that the recovery firetask won't try to create a new subdirectory and copy on a relative path or anything). It's gonna get a bit messy if a user tries to recover on a firetask that makes a directory, cds, and does something though.