kueue
kueue copied to clipboard
Dashboard support in Kueue
What would you like to be added:
It would be great if we can have an insight about what's our queueing system looks like at real time
- for administrators, it helps to understand the total resource usages in-between cluster queues and whether we should make them a cohort
- for batch users, they will have an overview about the job queueing, how many jobs are pending for scheduling, how long jobs are waiting.
Overall, it's a great enhancement especially for production env.
Why is this needed:
A big enhancement and a great insight of kueue system.
Completion requirements:
This enhancement requires the following artifacts:
- [x] Design doc
- [ ] API change
- [x] Docs update
The artifacts should be linked in subsequent comments.
Some advices here:
- For administrators:
- clusterQueue resource groups
- clusterQueue resource utilizations
- localQueues
- cohorts
- For users:
- jobs with their status
Do you have a high level idea of how to get there?
For example, would metrics + grafana yamls be enough for the administrative side?
For end users, certainly grafana wouldn't be viable. But what could be the MVP that would keep kueue largely non-opinionated and reusable (so you can integrate it with your own UI, if you already have one). Could we offer a CLI instead?
Kueue already spits out prometheus metrics. Building a UI based on that can be useful and the UI should be optional to deploy.
I do wonder if it is more useful for us to provide general purpose grafana dashboard and make it available in https://grafana.com/grafana/dashboards/
Building on metrics is helpful I think, but the dashboard is more than that, like it will display the basic information about the system, how many queues there, what their names are, how many jobs inside the queue, it can be interactive. We can get the information via the apis directly or we can have a lightweight database inside for cache, like sqlite.
We may need some frontend volunteers if we want to finish this work. As a MVP, IMHO, I think it should include
- Most of the API objects(clusterQueue, localQueue, Job, resourceFlavor, workload) at least, also including their relationships
- Some exported metrics
+1000, this is a much needed experience gap, I would be happy to review proposals.
@kerthcet may be we start by looking at similar batch schedulers and see what "screens" they offer to inform and help seed what we need to build?
@kerthcet may be we start by looking at similar batch schedulers and see what "screens" they offer to inform and help seed what we need to build?
Yes, that's a good approach, let me make a research first and then I'll share with your guys. I know YuniKorn has a dashboard. Also cc @BinL233
@ahg-g
it's a core feature which will make kueue easy to use.
A alternative would be airflow, and its UI looks like below
it contains task status, code, and audit logs which would be useful information for user to inspect their jobs.
besides, airflow integrates with idp like ldap or oidc and provides access control and permission managment features.
However, airflow doesn't provide an API to submit one-time run jobs like ml training jobs, which is the core application for kueue.
cc @B1F030 we also did some research around the popular queueing systems, I think we can provide a summary about the essential elements in dashboard, or even a prototype. Can you help with this @B1F030 ?
Hi guys, I want to try involved in the prototyping part of the dashbaord Desgin, and provide the prototype like figma.
Thanks @samzong We can provide a based design, and share with the community for feedbacks and then involve the developments, any concerns? @alculquicondor
Maybe we can start with a list of views you would like to have and do priority sorting
Maybe we can start with a list of views you would like to have and do priority sorting
@B1F030 is doing this.
Hey folks, https://github.com/armadaproject/armada has a UI in the form of lookout.
Our demo UI is here: https://ui.demo.armadaproject.io/
Let us know what you think of it - I think many parts of it could be suitable for lookout and we would be interested in contributing.
Thanks!
Thanks @Sharpz7 that's helpful, and we have a general idea now, @samzong is doing the prototyping, once we've done, we'll share a google doc/figma with your guys, hope to work together.
Great, Thank you :))
@Sharpz7 what is a lookout in this context?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale