galaxy
galaxy copied to clipboard
Defer evaluation of (dynamic) destinations to allow multiple resubmissions
May fix https://github.com/galaxyproject/galaxy/issues/7118
Currently resubmit definitions computed in dynamic destinations get lost because they are not persisted: https://github.com/galaxyproject/galaxy/blob/737775f586c5ab4c4cef4f486dd8826a978ed6b1/lib/galaxy/jobs/handler.py#L215
which (I guess) is a good idea because they are in most cases (static destinations) identical to the info in job_conf (so it would be redundant to store them again). Unfortunately the consequence is that dynamic destinations cant compute resubmit destinations. Or in other words they (seem) to be able to resubmit only once.
The idea of the PR is to defer calling the calculation of the dynamic destination. Before galaxy.jobs.runners.state_handlers.resubmit._handle_resubmit_definitions marked the job for resubmission and and called the dynamic destinations function (but the resubmit part got lost when the job reentered the queue).
Now the new destination is calculated (and cached) when the job reenters the queue.
The PR survived a first test on a dev instance with a singe dynamic destination that resubmits to itself. But I guess there are side effects ....
I use dynamic destinations and I do have a setup working where multiple resubmissions work. It boiled down to some complexity in the dynamic destinations and also some changes in the runner. Would you like to try that?
Hi @pcm32 . Sure. Sounds like a good idea.
Sorry for the delay. Basically I use the LSF cli runner plus dynamic destinations. For resubmissions to work I had to add a little change to the runner:
https://github.com/galaxyproject/galaxy/pull/6866#issuecomment-475547393
On my dynamic destinations I basically do:
job_destination = JobDestination(runner="cli", params=resource_params, id=settings['dest_id'])
if follow_up_destination is not None:
job_destination['resubmit'] = [dict(
condition="memory_limit_reached",
destination="dynamic-lsf-"+follow_up_destination,
)]
and then on my job_conf I have a number of dynamic destinations, where on each of them I set the parameter of what should be the one destination to follow on:
<destination id="dynamic-lsf-micro" runner="dynamic">
<param id="type">python</param>
<param id="function">lsf_wrapper_micro</param>
<resubmit condition="memory_limit_reached" destination="dynamic-lsf-tiny"/>
</destination>
<destination id="dynamic-lsf-tiny" runner="dynamic">
<param id="type">python</param>
<param id="function">lsf_wrapper_tiny</param>
<resubmit condition="memory_limit_reached" destination="dynamic-lsf-small"/>
</destination>
now that I look at it with fresh eyes, I think that there is some level of redundancy between the dynamic destination code and the XML (you can probably live with one of the other).
Thanks @pcm32
The runner that I use (univa) already handles OOM (woring in production with static destinations in job_conf.xml), so this should be fine.
Wondering about the which destinations and params are used subsequently on your installation. In particular I'm wondering if in your case more than 1 resubmission can happen?
Yes, actually various resubmission happen, we actually have 7 destinations that get tried for increasing amounts of memory, one after the other if they fail (all automatic).
unfortunately these are in internal infrastructure repos... let me try to make some gists to show the details...
https://gist.github.com/pcm32/a3311ab5eaec6698967dda8ab74296fe
Thanks @pcm32 .. I'm pretty sure that your setup uses the resubmit definitions as defined in the job_conf.xml and not the definitions computed in the python script (I guess they are the same anyway?).
Do you have the possibility to test this? E.g. by
- replacing
destination="dynamic-lsf-"+follow_up_destinationbydestination="hey_this_does not_exist"then I'm pretty sure that your setup still works - if you would replace/remove the resubmit definitions from the
job_conf.xmlthen it should fail
The point of the PR is that its not possible to compute resubmit definitions in dynamic destinations.
I'm pretty sure that your setup uses the resubmit definitions as defined in the job_conf.xml and not the definitions computed in the python script (I guess they are the same anyway?).
So I know that it uses both because we add some elements dynamically at the python code. So it uses pieces defined in the XML (like then resubmission destination) and then adds the memory and CPU as specified in the python part (we also add the input from the CPU and memory sliders in the tool UI, if activated). If I remove elements from the XML, it fails (so destinations I'm referring to need to exist in the XML). This is if I'm understanding your question/assertion completely.
Maybe I should have made more emphasis on this, but I suspect that if something similar to the scheme I mentioned (did you try something like this?) doesn't work with your runner, it boils down possibly to the comment that I quoted, putting an emphasis that the relevance is not about the OOM handling, but about marking the job as failed here:
https://github.com/galaxyproject/galaxy/pull/6866/files#diff-ffa83d3b10370ee42672282a77b1fa8286a1005c8296bd7d943ea7833b2e4928R183
Without that, resubmissions never worked for me. Have you checked if UNIVA deals with that in a similar way? In that same PR @mvdbeek observes that the only other runner (that back then) had working resubmissions was slurm, and slurm was doing a very similar thing.
Without that, resubmissions never worked for me. Have you checked if UNIVA deals with that in a similar way?
Basic resubmission with static destinations works fine with UNIVA. So I assume it should also with dynamic destinations?
@pcm32 do you have the possibility to test this: https://github.com/galaxyproject/galaxy/pull/9747#issuecomment-635874571
this would tests the main point of the PR: resubmission destinations can not be computed in the python script but they are reloaded from the job configuation, i.e. they are static - see also https://github.com/galaxyproject/galaxy/issues/7118
It seems you have a local test case, could you set up an integration test to make sure that at least what you're trying here works ?
Will try
I have just tried this:
if you would replace/remove the resubmit definitions from the job_conf.xml then it should fail
and as you say, it doesn't resubmit if it is removed.
I'm pushing off the milestone but it is a bug fix and we can backport if we figure this out. I have a vague sense you're right and there is a problem but I wrote half of this and cannot reason about it - I think an integration test as @mvdbeek suggested is a good idea. There are a bunch of integration tests for dynamic job runners and resubmission respectively - e.g. https://github.com/galaxyproject/galaxy/blob/dev/test/integration/test_job_resubmission.py - it might be worth tweaking one of those?
I agree 100% .. thanks for the tipp with the integration test.
Talking about the same issue in #12852 I've tested your patch with the tests I've made in #12852 but unfortunately it fails :(
Stalled for to long here. I added an integration test that should be successful here and fail on dev (in https://github.com/galaxyproject/galaxy/pull/13284).
Cool! Have you also seen the tests I've put in https://github.com/galaxyproject/galaxy/pull/12852 too? Maybe they can be useful too (or maybe not :D)
So __recover_job_wrapper is called when a handler (re-)starts, and when re-submitting a job
@mvdbeek do you see a way to restart the handler during the test (or woudn't this be useful)?
The comment in the removed code:
# Cache the destination to prevent rerunning dynamic after
# resubmit
makes me nervous: Is there a good reason for not allowing what I try to do? Maybe forbid cycles in the re-submission (can we have cycles is statically defined destinations?)?
@abretaud I cherry picked your commits and suggest a "fix" (we can also move the fix to your branch .. maybe smaller PRs are better .. or we keep aiming at the "big picture").
summarizing some of my thoughts:
The current state seems to be that only id, runner and params of a destination are persisted in the DB, but resubmit and env aren't; same probably for other destination properties https://github.com/galaxyproject/galaxy/blob/04ef6387c5b0faf2e71e6c75aa70804a48f125c4/lib/galaxy/jobs/init.py#L97
In the case of resubmit the code (comment) suggests that this is a good idea, but the resubmit info that is statically defined in the destination is recovered (in https://github.com/galaxyproject/galaxy/pull/9747/commits/28720386198d3c703efe1609514fc0f57ac42f59 the latest commit the env that is statically defined in the destination is also restored).
So overall dynamic re-submission seems to be limited to compute params and runner dynamically. Dynamic computation of everything else resubmit, env, ... seems to work only for one re-submission (actually I do not understand why it works for the first, but not the others).
For the case of a handler restart the only solution seems to be to restart the job from scratch if it failed and it is on a dynamic destination .. or let Galaxy persist everything.
For many/most use cases it might be sufficient to define resubmits and envs statically in the job_conf. For my instance I would like to use a fixed number of runtime escalation steps ([10min, 1day, 1week]) and define for certain tools to enter at the 2nd or 3rd step. Then a fixed resubmission definition does not work and I have no idea of a workaround.
With the current state some job properties seemingly get lost when resubmitted, e.g. Job.destination_params. Without this change the destination_params of the previous submission are available.