galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

Defer evaluation of (dynamic) destinations to allow multiple resubmissions

Open bernt-matthias opened this issue 5 years ago • 23 comments
trafficstars

May fix https://github.com/galaxyproject/galaxy/issues/7118

Currently resubmit definitions computed in dynamic destinations get lost because they are not persisted: https://github.com/galaxyproject/galaxy/blob/737775f586c5ab4c4cef4f486dd8826a978ed6b1/lib/galaxy/jobs/handler.py#L215

which (I guess) is a good idea because they are in most cases (static destinations) identical to the info in job_conf (so it would be redundant to store them again). Unfortunately the consequence is that dynamic destinations cant compute resubmit destinations. Or in other words they (seem) to be able to resubmit only once.

The idea of the PR is to defer calling the calculation of the dynamic destination. Before galaxy.jobs.runners.state_handlers.resubmit._handle_resubmit_definitions marked the job for resubmission and and called the dynamic destinations function (but the resubmit part got lost when the job reentered the queue).

Now the new destination is calculated (and cached) when the job reenters the queue.

The PR survived a first test on a dev instance with a singe dynamic destination that resubmits to itself. But I guess there are side effects ....

bernt-matthias avatar May 08 '20 21:05 bernt-matthias

I use dynamic destinations and I do have a setup working where multiple resubmissions work. It boiled down to some complexity in the dynamic destinations and also some changes in the runner. Would you like to try that?

pcm32 avatar May 23 '20 00:05 pcm32

Hi @pcm32 . Sure. Sounds like a good idea.

bernt-matthias avatar May 26 '20 09:05 bernt-matthias

Sorry for the delay. Basically I use the LSF cli runner plus dynamic destinations. For resubmissions to work I had to add a little change to the runner:

https://github.com/galaxyproject/galaxy/pull/6866#issuecomment-475547393

On my dynamic destinations I basically do:

job_destination = JobDestination(runner="cli", params=resource_params, id=settings['dest_id'])
if follow_up_destination is not None:
   job_destination['resubmit'] = [dict(
         condition="memory_limit_reached",
         destination="dynamic-lsf-"+follow_up_destination,
       )]

and then on my job_conf I have a number of dynamic destinations, where on each of them I set the parameter of what should be the one destination to follow on:

<destination id="dynamic-lsf-micro" runner="dynamic">
      <param id="type">python</param>
      <param id="function">lsf_wrapper_micro</param>
      <resubmit condition="memory_limit_reached" destination="dynamic-lsf-tiny"/>
    </destination>
    <destination id="dynamic-lsf-tiny" runner="dynamic">
      <param id="type">python</param>
      <param id="function">lsf_wrapper_tiny</param>
      <resubmit condition="memory_limit_reached" destination="dynamic-lsf-small"/>
    </destination>

now that I look at it with fresh eyes, I think that there is some level of redundancy between the dynamic destination code and the XML (you can probably live with one of the other).

pcm32 avatar May 28 '20 16:05 pcm32

Thanks @pcm32

The runner that I use (univa) already handles OOM (woring in production with static destinations in job_conf.xml), so this should be fine.

Wondering about the which destinations and params are used subsequently on your installation. In particular I'm wondering if in your case more than 1 resubmission can happen?

bernt-matthias avatar May 28 '20 16:05 bernt-matthias

Yes, actually various resubmission happen, we actually have 7 destinations that get tried for increasing amounts of memory, one after the other if they fail (all automatic).

pcm32 avatar May 28 '20 16:05 pcm32

unfortunately these are in internal infrastructure repos... let me try to make some gists to show the details...

pcm32 avatar May 28 '20 16:05 pcm32

https://gist.github.com/pcm32/a3311ab5eaec6698967dda8ab74296fe

pcm32 avatar May 28 '20 16:05 pcm32

Thanks @pcm32 .. I'm pretty sure that your setup uses the resubmit definitions as defined in the job_conf.xml and not the definitions computed in the python script (I guess they are the same anyway?).

Do you have the possibility to test this? E.g. by

  • replacing destination="dynamic-lsf-"+follow_up_destination by destination="hey_this_does not_exist" then I'm pretty sure that your setup still works
  • if you would replace/remove the resubmit definitions from the job_conf.xml then it should fail

The point of the PR is that its not possible to compute resubmit definitions in dynamic destinations.

bernt-matthias avatar May 29 '20 09:05 bernt-matthias

I'm pretty sure that your setup uses the resubmit definitions as defined in the job_conf.xml and not the definitions computed in the python script (I guess they are the same anyway?).

So I know that it uses both because we add some elements dynamically at the python code. So it uses pieces defined in the XML (like then resubmission destination) and then adds the memory and CPU as specified in the python part (we also add the input from the CPU and memory sliders in the tool UI, if activated). If I remove elements from the XML, it fails (so destinations I'm referring to need to exist in the XML). This is if I'm understanding your question/assertion completely.

pcm32 avatar Jan 12 '21 12:01 pcm32

Maybe I should have made more emphasis on this, but I suspect that if something similar to the scheme I mentioned (did you try something like this?) doesn't work with your runner, it boils down possibly to the comment that I quoted, putting an emphasis that the relevance is not about the OOM handling, but about marking the job as failed here:

https://github.com/galaxyproject/galaxy/pull/6866/files#diff-ffa83d3b10370ee42672282a77b1fa8286a1005c8296bd7d943ea7833b2e4928R183

Without that, resubmissions never worked for me. Have you checked if UNIVA deals with that in a similar way? In that same PR @mvdbeek observes that the only other runner (that back then) had working resubmissions was slurm, and slurm was doing a very similar thing.

pcm32 avatar Jan 12 '21 12:01 pcm32

Without that, resubmissions never worked for me. Have you checked if UNIVA deals with that in a similar way?

Basic resubmission with static destinations works fine with UNIVA. So I assume it should also with dynamic destinations?

bernt-matthias avatar Jan 12 '21 13:01 bernt-matthias

@pcm32 do you have the possibility to test this: https://github.com/galaxyproject/galaxy/pull/9747#issuecomment-635874571

this would tests the main point of the PR: resubmission destinations can not be computed in the python script but they are reloaded from the job configuation, i.e. they are static - see also https://github.com/galaxyproject/galaxy/issues/7118

bernt-matthias avatar Jan 12 '21 13:01 bernt-matthias

It seems you have a local test case, could you set up an integration test to make sure that at least what you're trying here works ?

Will try

bernt-matthias avatar Jan 12 '21 13:01 bernt-matthias

I have just tried this:

if you would replace/remove the resubmit definitions from the job_conf.xml then it should fail

and as you say, it doesn't resubmit if it is removed.

pcm32 avatar May 12 '21 14:05 pcm32

I'm pushing off the milestone but it is a bug fix and we can backport if we figure this out. I have a vague sense you're right and there is a problem but I wrote half of this and cannot reason about it - I think an integration test as @mvdbeek suggested is a good idea. There are a bunch of integration tests for dynamic job runners and resubmission respectively - e.g. https://github.com/galaxyproject/galaxy/blob/dev/test/integration/test_job_resubmission.py - it might be worth tweaking one of those?

jmchilton avatar Sep 08 '21 12:09 jmchilton

I agree 100% .. thanks for the tipp with the integration test.

bernt-matthias avatar Sep 08 '21 13:09 bernt-matthias

Talking about the same issue in #12852 I've tested your patch with the tests I've made in #12852 but unfortunately it fails :(

abretaud avatar Nov 08 '21 08:11 abretaud

Stalled for to long here. I added an integration test that should be successful here and fail on dev (in https://github.com/galaxyproject/galaxy/pull/13284).

bernt-matthias avatar Feb 01 '22 14:02 bernt-matthias

Cool! Have you also seen the tests I've put in https://github.com/galaxyproject/galaxy/pull/12852 too? Maybe they can be useful too (or maybe not :D)

abretaud avatar Feb 01 '22 14:02 abretaud

So __recover_job_wrapper is called when a handler (re-)starts, and when re-submitting a job

@mvdbeek do you see a way to restart the handler during the test (or woudn't this be useful)?

bernt-matthias avatar Feb 01 '22 16:02 bernt-matthias

The comment in the removed code:

# Cache the destination to prevent rerunning dynamic after
# resubmit

makes me nervous: Is there a good reason for not allowing what I try to do? Maybe forbid cycles in the re-submission (can we have cycles is statically defined destinations?)?

bernt-matthias avatar Feb 02 '22 09:02 bernt-matthias

@abretaud I cherry picked your commits and suggest a "fix" (we can also move the fix to your branch .. maybe smaller PRs are better .. or we keep aiming at the "big picture").

summarizing some of my thoughts:

The current state seems to be that only id, runner and params of a destination are persisted in the DB, but resubmit and env aren't; same probably for other destination properties https://github.com/galaxyproject/galaxy/blob/04ef6387c5b0faf2e71e6c75aa70804a48f125c4/lib/galaxy/jobs/init.py#L97

In the case of resubmit the code (comment) suggests that this is a good idea, but the resubmit info that is statically defined in the destination is recovered (in https://github.com/galaxyproject/galaxy/pull/9747/commits/28720386198d3c703efe1609514fc0f57ac42f59 the latest commit the env that is statically defined in the destination is also restored).

So overall dynamic re-submission seems to be limited to compute params and runner dynamically. Dynamic computation of everything else resubmit, env, ... seems to work only for one re-submission (actually I do not understand why it works for the first, but not the others). For the case of a handler restart the only solution seems to be to restart the job from scratch if it failed and it is on a dynamic destination .. or let Galaxy persist everything.

For many/most use cases it might be sufficient to define resubmits and envs statically in the job_conf. For my instance I would like to use a fixed number of runtime escalation steps ([10min, 1day, 1week]) and define for certain tools to enter at the 2nd or 3rd step. Then a fixed resubmission definition does not work and I have no idea of a workaround.

bernt-matthias avatar Feb 02 '22 12:02 bernt-matthias

With the current state some job properties seemingly get lost when resubmitted, e.g. Job.destination_params. Without this change the destination_params of the previous submission are available.

bernt-matthias avatar Mar 21 '22 18:03 bernt-matthias