Albin Severinson
Albin Severinson
I think it's best we return an error immediately on this `if auth fails because creds are invalid, error immediately`. (If I've said differently previously, I've changed my mind :)...
This is a fairly tricky ticket that'll require careful investigation. Gogo is allegedly significantly faster than Google proto. There are also other proto compilers out there that may be viable....
This isn't a priority. I'm closing the ticket. If it becomes important in the future, we can re-open it.
I've made a prototype PR for this: https://github.com/G-Research/armada/pull/1342/files
Here's an overview of the various job types I've seen. **batch/v1/job** - [Standard Kubernetes job](https://pkg.go.dev/k8s.io/api/batch/v1#Job) **Volcano** - [PodGroup](https://volcano.sh/en/docs/podgroup/) - [Job](https://volcano.sh/en/docs/vcjob/) and on [Github](https://github.com/volcano-sh/volcano/blob/master/docs/design/job-api.md). ** PodGroup staqndardisation ** There's also work...
Dominant resource fairness (DRF) is likely a better policy than max-min fairness. DRF is introduced in https://cs.stanford.edu/~matei/papers/2011/nsdi_drf.pdf Volcano and a similar scheduler used by Nvidia both use DRF. Volcano: https://github.com/volcano-sh/volcano/blob/master/docs/design/drf.md...
We could consider using hierarchical queues to more effectively balance resource usage between teams. There's a related notion in network engineering we could draw inspiration from. https://en.wikipedia.org/wiki/Token_bucket
Consider automatically reducing the weight of queues that are inactive for extended periods of time and ramping up the priority again once they become active.
Consider earmarking nodes for jobs and allowing lower-priority jobs to backfill onto other nodes.
We should have a notion of job urgency in the spec. Can be implemented as follows. For each job, the user specifies one of the following priority groups: 1. Run...