cylc-ui icon indicating copy to clipboard operation
cylc-ui copied to clipboard

Workflow task timings view

Open JAllen42 opened this issue 2 years ago • 9 comments

This would aim to give users the information needed to understand the durations of different tasks, including how they vary. This would hopefully help when developing workflows, such as when trying to identify bottlenecks and other places to make changes in order to improve the throughput or resource usage of the workflow if it is run again.

This basics of this view would be to display the durations of tasks in workflows after running them. There are different ways you could display these timings - using individual values or their statistics, and different ways you could plot these. Which display you might want would probably be dependent on what people are trying to find out from the data. Therefore it might be worthwhile to give users a choice within this one view, keeping in mind the balance between complexity and flexibility.

This is very similar to #262.

Additional context Ultimately it would be good to help users understand when a task might run, how long for and how these depend on both the task prerequisites and individual timings. This may be best achieved by something similar to a Gantt view at some point in the future.

Initial thoughts Here are a couple of quick outlines of the sorts of plots that could be included:

BoxPlot

To look at the statistics you could display the distribution of timings using a box plot. The kind of things that the user could have control over include:

  • Filtering which tasks to display
  • What time is shown (ie run time, queue time, total)
  • What order to display results (longest or shortest duration first, means, medians, maxima etc)
  • Some way to display the values of the statistics (eg when hovering over/clicking on the plots)
Table

It might be simpler to first implement a table of the different statistics, though I'm not sure it is as useful as a boxplot, or necessary if a way to display the values is implemented.

TimeVsCyclePoint

To plot the individual values of the task durations, I can't think of anything better than doing it against the cycle point. This is potentially a lot of information to plot, especially for workflows with many tasks - it might make sense to give users more control over which tasks they plot.

We have time to work on implementing this view, which would probably mean adding different ways of displaying the timing data one by one, aiming to have the first one finished this financial year. There will be details I have overlooked for now, so all comments and questions are welcome.

JAllen42 avatar Dec 13 '22 19:12 JAllen42

Just to note a possible extension before I forget - as well as cycle point, ensemble members are a "nice to have" but not essential for the initial version, it's a future work thing.

scwhitehouse avatar Dec 14 '22 11:12 scwhitehouse

Looks good, cheers!

To look at the statistics you could display the distribution of timings using a box plot

I think the pagination is a good way to control the amount of data being visualised :+1:.

What time is shown (ie run time, queue time, total)

I wonder if there is a way of combining this for a box chart without making a mess of the error bars for a more at-a-glance view?

Filtering which tasks to display What order to display results (longest or shortest duration first, means, medians, maxima etc)

I think it might be reasonable to implement the first two components using a table (the box chart could fit into a table if generated row-by-row). If so that would allow you to use the table's built-in sorting and might allow you to share the filtering code used in the existing table view.

What order to display results (longest or shortest duration first, means, medians, maxima etc)

Sounds good to me. You might want to constrain the cycle-window a bit i.e. allow the user to specify the range of cycles graphed, perhaps with a double headed slider.


Questions:

  1. Which jobs will you consider?
    • If a task fails twice, then succeeds, do you want all three jobs, or just the last one?
    • If a task succeeds multiple times i.e. if it is re-triggered after succeeding (e.g. user may change a variable and re-run), do you want all the successful jobs, or the first successful job or the last successful job?
  2. Will you require any information about the tasks themselves?
    • Note: the submitted, started, finished times come from the jobs rather than the tasks.
    • Other than the name and state of the task will you be wanting any more information?

oliver-sanders avatar Dec 14 '22 11:12 oliver-sanders

Just to note a possible extension before I forget - as well as cycle point, ensemble members are a "nice to have" but not essential for the initial version, it's a future work thing.

An interesting idea. We can use families to group tasks which might be useful for other purposes? At present there's nothing that Cylc can use to tell whether a family represents an ensemble or not but there may be options.

oliver-sanders avatar Dec 14 '22 12:12 oliver-sanders

@JAllen42 & @scwhitehouse

I'll send you invites to join one of our teams which will give you more powers in this repository to make development easier. Could you get any one else who will be working on the repo to comment here and I'll invite them too (will check email addresses first!).

oliver-sanders avatar Dec 14 '22 12:12 oliver-sanders

Don't forget me!

ChrisPaulBennett avatar Dec 15 '22 08:12 ChrisPaulBennett

Thanks for the comments Oliver!

Which jobs will you consider?

My initial thought was that we wouldn't need failed jobs, but including the number of failures would be a nice thing to do and should work quite well for some of the options above at least, so all of them sounds preferable to me.

If there are multiple successes then I think it depends on the situation when deciding which one is more important (though it might usually be the last one?), so again all of them sounds better to me. How do current tools work in this situation, eg cylc report timings?

But for both situations if one of the options is significantly harder then choosing a different option sounds fine to me.

Will you require any information about the tasks themselves?

I don't think so, at least initially. Having number of cores/nodes used so that total resource usage could be displayed would be good, but from our discussions that sounds like quite a bit of work to make it work for different platforms, so would need more discussions about how that is going to be implemented.

JAllen42 avatar Dec 15 '22 17:12 JAllen42

all of them sounds better to me

Ok, the POC interface I put in on the draft PR does just that.

eg cylc report timings?

No idea TBH!

Having number of cores/nodes used so that total resource usage could be displayed

We have now got the full task configuration into the schema which means you can request this information BUT:

  • The info you'll receive will be what's written in the config which is likely but not necessarily the config the job was submitted with.
    • For most cases this is good enough but be aware it may differ from truth:
    • E.G. the workflow could have been restarted with an updated config, so jobs which ran before the restart will show the new settings not the ones they submitted with.
    • E.G. the job config could be broadcasted at run time overriding the defined config.
  • The directives will be in their native format (e.g. PBS / Slurm / whatever) so you'll need to provide your own abstraction.
    • We could locate this abstraction server-side and do it in Python.

oliver-sanders avatar Dec 16 '22 11:12 oliver-sanders

I'll invite all three of you to one of our internal teams (once I've got access myself!).

In the mean time here's a link to an Element chat room which we use to talk about cylc-ui development which you're welcome to join.

https://matrix.to/#/#cylc-web-gui:matrix.org

oliver-sanders avatar Dec 16 '22 11:12 oliver-sanders

Data structure

Right, next thing to work out is what structure you'll need the data in for these displays.

For the three displays outlined:

  1. Box chart (averages)
  2. Timings table (averages)
  3. Timings graph (point-data)

The data requirements of (1) & (2) are the same, they require averages which could be computed server-side to reduce load. The requirements of (3) are a little different as it actually requires data on individual jobs.

Because of this it might be logical to split this work into two views as its easier to keep the data model view-centric because of the way subscriptions and housekeeping work in cylc-ui.

See what you think...

(1) & (2) - task averages

Although it could be done client-side it would be more efficient to compute these averages server-side. It might be worth including a failure rate metric with these averages to help identify flaky tasks.

Because the timings information is quite small we can happily load a fairly large number of tasks in the UI, however, it is not uncommon for workflows to contain tens of thousands of tasks (or even more :open_mouth:) so we will need to cut this off at a configured limit. E.G. we could request the data paginated with up to 1000 tasks per page. Note that if there are more than 1'000 tasks and the user sorts the table we would need to re-send the request with the new sort order because of this.

I think this data structure would make sense:

tasks = [
  {
    'name': '<task-name>',
    'configuration': {
      // the [runtime] configuration of the task as defined in the flow.cylc file
      '<key>': '<value>',
      '<section-name>': {
        '<key>': '<value>',
      },
    }
    'timings': {
      // computed averages
      '<key>': '<value>',
    }
  }
]

Takeaways:

  • Paginated request to reduce the number of requested tasks.
  • Paginated table to reduce the number of displayed items.
  • Compute averages server-side.

(3) - job data

For (3) you need access to individual jobs (currently this is what #1171 does), possibly filtered by:

  • Only show successful jobs.
  • Only show the last job.

This view would benefit more from being updated live as the data changes, luckily this is fairly straight-forward, #1171 already does this. It might be worth finding a plotting library with Vue integration as this will likely handle live updates better.

tasks = {
  '<task-name>': {
    'configuration' {
      // the [runtime] configuration of the task as defined in the flow.cylc file
      '<key>': '<value>',
      '<section-name>': {
        '<key>': '<value>',
      },
    }
    'jobs': [
      // list of all jobs - filter client-side as needed
      {
        '<key>': '<value>',
      },
    ]
  },
}

Takeaways:

  • UI to receive all jobs, filtering to be done client-side.
  • Only request data on a task-by-task basis.
  • Try to make this view reactive (i.e. update to live data).

Server-Side Interfaces

  • Timings information:
    • Need to compute averages server-side and put these into the schema.
    • Could consider opening this up as a subscription to provide live data, but I don't think that's so important to start with, we can give users a refresh button.
  • Task information:
    • Needs pagination controls (i.e. give me the first N task).
    • Needs to provide sorting controls (e.g. sort by average job run time).

Howz that sound?

oliver-sanders avatar Dec 16 '22 13:12 oliver-sanders

Closed by #1254, #1510, thanks all!

oliver-sanders avatar May 17 '24 12:05 oliver-sanders