toil icon indicating copy to clipboard operation
toil copied to clipboard

Chaining is hard to track

Open adamnovak opened this issue 1 year ago • 2 comments
trafficstars

Right now, Job ~has both a chainedJobs field, which is a list of str() on each job in the chain in order, used by the stats and logging system to dump the worker's log to per-job files, and also a merged_jobs field which holds job store IDs for jobs in chains, to allow deleting files that belonged to them. THis is duplicative; we should have one collection with all the information we need about jobs that chained together.~ (Fixed in #4737)

Also, we report some reduced job name information in the message bus, but that information can change between when a job is issued and when it completes, because of chaining. The job might issue with one name, but chain to a different job with a different name under the same job store ID, and come back with the name of the last job in the chain. This makes the message bus hard to interpret.

Also also, jobs can delete themselves after chaining to other jobs, and never come back to the leader to report the jobs that were chained. So some jobs can go through their whole life cycle without the leader noticing they ever existed. This is good for efficiency, but bad if someone wants a coherent log of the history of all their WDL or CWL tasks. It seems like a CWLWrapperJob could create a CWLJob, chain to it, delete it, delete itself, and never tell the leader.

We should think through this system better. We should have one place to record all the jobs a job chained to, and that information should survive the job deletion so the leader can produce a full inventory of all jobs that ran in the message bus.

Alternately, once we unify the representations of chaining, we could do more work to hide the Toil jobs from CWL and WDL users, and send some kind of structured information about the CWL and WDL tasks themselves back to the leader in real time.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1474

adamnovak avatar Jan 04 '24 21:01 adamnovak

➤ Adam Novak commented:

We do now send WDL task logs back to the leader when tasks run, so you can see your tasks' individual logs as evidence that they ran.

unito-bot avatar May 07 '24 17:05 unito-bot

➤ Adam Novak commented:

Maybe we want to replace the overwrites in chaining with a pointer field on the JobDescription?

Or maybe we could change job store IDs from referring to jobs to referring to some kind of job slot.

Or maybe we could come up with a hierarchical way to refer to e.g. the second job in a chain, for toil debug-job to use.

unito-bot avatar Jul 23 '24 17:07 unito-bot