kueue icon indicating copy to clipboard operation
kueue copied to clipboard

Dashboard support in Kueue

Open kerthcet opened this issue 2 years ago • 20 comments

What would you like to be added:

It would be great if we can have an insight about what's our queueing system looks like at real time

  • for administrators, it helps to understand the total resource usages in-between cluster queues and whether we should make them a cohort
  • for batch users, they will have an overview about the job queueing, how many jobs are pending for scheduling, how long jobs are waiting.

Overall, it's a great enhancement especially for production env.

Why is this needed:

A big enhancement and a great insight of kueue system.

Completion requirements:

This enhancement requires the following artifacts:

  • [x] Design doc
  • [ ] API change
  • [x] Docs update

The artifacts should be linked in subsequent comments.

Some advices here:

  • For administrators:
    • clusterQueue resource groups
    • clusterQueue resource utilizations
    • localQueues
    • cohorts
  • For users:
    • jobs with their status

kerthcet avatar Jul 03 '23 09:07 kerthcet

Do you have a high level idea of how to get there?

For example, would metrics + grafana yamls be enough for the administrative side?

For end users, certainly grafana wouldn't be viable. But what could be the MVP that would keep kueue largely non-opinionated and reusable (so you can integrate it with your own UI, if you already have one). Could we offer a CLI instead?

alculquicondor avatar Jul 04 '23 14:07 alculquicondor

Kueue already spits out prometheus metrics. Building a UI based on that can be useful and the UI should be optional to deploy.

I do wonder if it is more useful for us to provide general purpose grafana dashboard and make it available in https://grafana.com/grafana/dashboards/

moficodes avatar Jul 07 '23 17:07 moficodes

Building on metrics is helpful I think, but the dashboard is more than that, like it will display the basic information about the system, how many queues there, what their names are, how many jobs inside the queue, it can be interactive. We can get the information via the apis directly or we can have a lightweight database inside for cache, like sqlite.

We may need some frontend volunteers if we want to finish this work. As a MVP, IMHO, I think it should include

  • Most of the API objects(clusterQueue, localQueue, Job, resourceFlavor, workload) at least, also including their relationships
  • Some exported metrics

kerthcet avatar Jul 10 '23 07:07 kerthcet

+1000, this is a much needed experience gap, I would be happy to review proposals.

@kerthcet may be we start by looking at similar batch schedulers and see what "screens" they offer to inform and help seed what we need to build?

ahg-g avatar Jul 11 '23 17:07 ahg-g

@kerthcet may be we start by looking at similar batch schedulers and see what "screens" they offer to inform and help seed what we need to build?

Yes, that's a good approach, let me make a research first and then I'll share with your guys. I know YuniKorn has a dashboard. Also cc @BinL233

kerthcet avatar Jul 18 '23 07:07 kerthcet

@ahg-g it's a core feature which will make kueue easy to use. A alternative would be airflow, and its UI looks like below image it contains task status, code, and audit logs which would be useful information for user to inspect their jobs. besides, airflow integrates with idp like ldap or oidc and provides access control and permission managment features.

However, airflow doesn't provide an API to submit one-time run jobs like ml training jobs, which is the core application for kueue.

zeddit avatar Nov 03 '23 11:11 zeddit

cc @B1F030 we also did some research around the popular queueing systems, I think we can provide a summary about the essential elements in dashboard, or even a prototype. Can you help with this @B1F030 ?

kerthcet avatar Nov 06 '23 02:11 kerthcet

Hi guys, I want to try involved in the prototyping part of the dashbaord Desgin, and provide the prototype like figma.

samzong avatar Nov 07 '23 03:11 samzong

Thanks @samzong We can provide a based design, and share with the community for feedbacks and then involve the developments, any concerns? @alculquicondor

kerthcet avatar Nov 07 '23 03:11 kerthcet

Maybe we can start with a list of views you would like to have and do priority sorting

alculquicondor avatar Nov 07 '23 13:11 alculquicondor

Maybe we can start with a list of views you would like to have and do priority sorting

@B1F030 is doing this.

kerthcet avatar Nov 08 '23 01:11 kerthcet

Hey folks, https://github.com/armadaproject/armada has a UI in the form of lookout.

Our demo UI is here: https://ui.demo.armadaproject.io/

Let us know what you think of it - I think many parts of it could be suitable for lookout and we would be interested in contributing.

Thanks!

Sharpz7 avatar Dec 11 '23 02:12 Sharpz7

Thanks @Sharpz7 that's helpful, and we have a general idea now, @samzong is doing the prototyping, once we've done, we'll share a google doc/figma with your guys, hope to work together.

kerthcet avatar Dec 11 '23 02:12 kerthcet

Great, Thank you :))

Sharpz7 avatar Dec 11 '23 03:12 Sharpz7

@Sharpz7 what is a lookout in this context?

alculquicondor avatar Dec 11 '23 15:12 alculquicondor

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 10 '24 16:03 k8s-triage-robot

/remove-lifecycle stale

tenzen-y avatar Mar 10 '24 23:03 tenzen-y