galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

[WIP] Add new job states for better feedback to users

Open natefoo opened this issue 6 years ago • 30 comments

This adds 7 new job and dataset states:

  • waiting: The dataset has been picked up by a handler (only used by the db-skip-locked and db-transaction-isolation handler assignment methods) but its input datasets are not ready yet.
  • limited: The job is ready to run but is waiting due to concurrency limits.
  • dispatched: The job has been dispatched to a runner plugin.
  • submitted: The job has been submitted to the cluster (or Pulsar) but is not yet queued.
  • stagein: The job's inputs are staging in (Pulsar runner only).
  • stagein: The job's outputs are staging out (Pulsar runner only).
  • finishing: Job execution is complete and the job finish method is running.

The meaning of the new state has changed: If using the old handler assignment methods, it retains two of its old meanings: either it hasn't been picked up by a handler or its inputs are not ready. If using one of the new methods, it only means that it hasn't been picked up by a handler. In all assignment configs, limited jobs are now assigned their own state (before they were left in new).

With this change, the queued state is only set once the job has been reported as queued by the DRM.

In a healthy Galaxy you should rarely see some of these states: new (unless using the old assignment methods) indicates handler job "grabber" problems, and dispatched and submitted indicate handler runner plugin or DRM problems (submitted only if the destination is not Pulsar). This should help with using e.g. Grafana to detect problems with Galaxy handlers without having to run your own test jobs.

I've also made the messages more informative, but am very open to ideas on better language for these. I want it to be clear to the user what they should do (mostly: wait).

Thanks for the client work, @dannon!

TODO (maybe in another PR):

  • [x] Add finishing state
  • [ ] Make Pulsar differentiate between submitted and queued
  • [x] Handle submitted vs. queued states in the Pulsar runner
  • [x] Add staging-in and staging-out states for Pulsar

natefoo avatar Apr 22 '19 18:04 natefoo

Moved to WIP - I forgot I wanted to add a finishing state.

natefoo avatar Apr 22 '19 20:04 natefoo

@dannon I added a few more states. Currently stagein uses the same style as submitted, and both stageout and finishing use the same style as running. I am kinda ambivalent on staging, maybe we could give them a file-transfery icon, or maybe they are fine as-is. For finishing can we increase the speed of the spinner maybe?

natefoo avatar Apr 22 '19 22:04 natefoo

If we're just adopting Pulsar states outright in Galaxy - I'm not sure the benefit of renaming them. I trust you though @natefoo 👍

jmchilton avatar Apr 24 '19 13:04 jmchilton

@jmchilton I did consider it...

natefoo avatar Apr 24 '19 14:04 natefoo

@dannon thanks!

natefoo avatar Apr 24 '19 14:04 natefoo

The latest API failure wasn't failing until my most recent commit and doesn't fail for me locally. Few of these tests I've been fixing have ever failed for me locally. So I'm not sure if I broke something or not.

Also I guess I broke the scratchbook selenium test but I can't figure out how? This one does at least fail for me locally but looking at the error files I can't see why the scratchbook modal isn't appearing. The dataset is green and the eye is clicked.

natefoo avatar Apr 24 '19 18:04 natefoo

I've triggered a re-run for the failed API test. I haven't seen this test fail in a while, though this is a tricky one. It runs a workflow, and then runs it again with the job cache turned on. It tests that a equivalent job is found and no new jobs will actually be run, and that's the part that failed.

You may need to add all new non-error states to https://github.com/galaxyproject/galaxy/blob/fe2c14d3299c863e8d623a23f972944eebb93210/lib/galaxy/managers/jobs.py#L112 for this test to pass reliably

mvdbeek avatar Apr 24 '19 20:04 mvdbeek

... which you already did, but I think we also need a couple more, basically all that are not error or deleted

mvdbeek avatar Apr 24 '19 20:04 mvdbeek

Is it possible to make APIs for tool wrappers to insert progress bar just for the running state?

qiagu avatar Apr 25 '19 17:04 qiagu

@qiagu can you elaborate? (maybe in a different issue)

martenson avatar Apr 25 '19 17:04 martenson

I just opened a new ticket.

qiagu avatar Apr 25 '19 17:04 qiagu

#7822 is the new issue for @qiagu's question, just for future reference.

natefoo avatar Apr 25 '19 18:04 natefoo

@natefoo Fix one selenium test error and another one pops up. I'll take a look at that one now.

dannon avatar Apr 25 '19 18:04 dannon

A thought occurred to me that we should probably consider before adding all these new states: Should we really be adding these states as both dataset and job states, or should we have essentially one "non-terminal" dataset state (or keep "new", "queued", and "running" just for backwards compatibility) and add the the job state to the history and history update APIs? The client could then use the job state instead of the dataset state (or at least use the job state while the dataset state is non-terminal). Using the job state here directly is more correct than simply passing the job state down as the dataset state all the time. It might also cause less breakage in the future for things that watch the dataset state from an expected list of states (as most of these bugs have been) - only the client really cares about these new states, since they are mainly for user feedback.

I vaguely recall a conversation that maybe putting job states in the history and history update APIs might be too expensive. So maybe correctness isn't worth it.

Thoughts?

natefoo avatar Apr 25 '19 18:04 natefoo

Regarding 'one non-terminal' dataset state -- these additional states may or may not be the right vector for conveying it, but the additional info we're presenting to the user at least in terms of the state 'blurb' -- this dataset has gone to the upside down or whatever is definitely a plus, here. And I don't know of a better way off the top of my head to get that to the user.

dannon avatar Apr 25 '19 18:04 dannon

I might’ve been confusing so just in case: we’d sill have all these fancy new states, we just would only set them on the Job, not the Job and Dataset as we do now. This would require a fair amount of client and API changes to allow the client to use the job state.

Sticking with Dataset state is certainly the easy path since that’s how we already do it. Just wanted to get some opinions before I started polluting everyone’s databases.

natefoo avatar Apr 25 '19 18:04 natefoo

@natefoo Ahh! Yeah, I misunderstood what you meant; I do like the idea of just using the job state directly. I don't know if it's going to be best to add the job state directly to the history/contents API or simply progressively load the details via a job_id handle on the dataset that we then use to query the jobs API, but that's something we could experiment with to see how expensive it is.

dannon avatar Apr 25 '19 18:04 dannon

The selenium test that's failing is completely unrelated to the changes here, and is also failing in a separate PR (https://github.com/galaxyproject/galaxy/pull/7749). The one related selenium error was fixed in 59e4371. I don't know why it's failing yet, and it doesn't fail locally, but it shouldn't prevent merging this.

That said, @natefoo, think we want to bump this to 19.09 given the conversation above?

dannon avatar Apr 25 '19 22:04 dannon

Yep, definitely want to bump this. Thanks!

natefoo avatar Apr 25 '19 23:04 natefoo

Ugh... I'm sorry to throw you under the bus but change_state is so expensive and this adds new ones for states that will only exist for a very brief period of time. My intuition is this will significantly reduce Galaxy's already crummy throughput for large numbers of jobs (e.g. collection operations, batch workflows, etc..). I'm -0 on this PR in this form - if we could refactor change_state into some sort of compound SQL statement or eliminate 90% of state tracking on datasets and offload them to job state tracking (or ideally both) I'd be much more eager to see this in.

If you want me to produce some throughput data to justify my position I will, but I'd rather spend the time coming up with a more optimal solution. Any idea how to implement delegation of dataset states to job states?

jmchilton avatar Sep 05 '19 13:09 jmchilton

Any idea how to implement delegation of dataset states to job states?

An SQL trigger maybe?

nsoranzo avatar Sep 05 '19 13:09 nsoranzo

An SQL trigger maybe?

Better than a Python loop (assuming we don't need to reload all the dataset association data into sql alchemy anyway...) but we still would have to update potentially thousands of rows in the transaction. It'd be better if there was just some link in the dataset that said "here is the job that has my state, unless I have a metadata error".

As a stop gap though - a trigger might really be a huge help and could probably be implemented independently of this PR.

jmchilton avatar Sep 05 '19 13:09 jmchilton

So, moving the state tracking to the job was exactly why this PR stalled (albeit not for performance reasons, but yeah, I see how this is potentially problematic). That said, I am not sure that the reverse (including job state in the query used to display history contents) will perform well either.

natefoo avatar Sep 06 '19 15:09 natefoo

What if every dataset association had a direct link to the job that specifies its state? Our in-python creating_job logic is definitely not going to cut it.

jmchilton avatar Sep 06 '19 15:09 jmchilton

Yes please, that would be great!

mvdbeek avatar Sep 06 '19 15:09 mvdbeek

Yeah, I think that was what I was waffling over - whether HDAs, LDDAs, and all other dataset instances should get a creating_job_id column, although this duplicates the job_to_output_* tables. IIRC there is already dataset-referencing table that does this so it didn't seem like a terrible change? I forget which one, though.

natefoo avatar Sep 06 '19 15:09 natefoo

Our in-python creating_job logic is definitely not going to cut it.

We could probably use a column property to move this into the sqlalchemy layer. That should probably be more performant ?

mvdbeek avatar Sep 06 '19 15:09 mvdbeek

although this duplicates the job_to_output_* tables

It is encoding slightly different information (in lieu of stuff like copied datasets, etc..) - so I wouldn't really call it duplicating those at all. It would also bring collections and datasets more inline - since collections get their state this way. I might not call it creating_job_id though - but something like that.

jmchilton avatar Sep 06 '19 15:09 jmchilton

since collections get their state this way

Ah yeah, that was what I was thinking of.

natefoo avatar Sep 06 '19 20:09 natefoo

https://github.com/galaxyproject/galaxy/pull/10222 being merged might mean we can make some progress here ... but pushing to 21.09 for now

mvdbeek avatar Apr 07 '21 15:04 mvdbeek