nomad icon indicating copy to clipboard operation
nomad copied to clipboard

ACL: add a capability to allow coarse topology insights across namespaces without actual namespace/read-job access

Open shoeffner opened this issue 1 year ago • 0 comments
trafficstars

Proposal

Add a policy which allows to list a summary of allocations or a topology-tailored resource so that the topology view can be filled with relevant data without exposing all job properties. For instance, the topology view needs only a subset of the allocation information (ram, cpu, node, maybe job name and namespace – in the best case this could even be "redacted") and the basic information about nodes to be meaningful.

Use-cases

Our users want to know the current resource allocation across nodes (best they'd even see used GPUs, but that's a different issue). However, they only get access to a limited number of namespaces which results in an incomplete view. For some users, the cluster looks almost empty when they navigate to http://localhost:4646/ui/topology, making the view useless at best but potentially even misleading.

The main reason is that they want to have an overview of all the resources to get a better idea why their allocations could not be scheduled and how much is currently in use to make better decisions (e.g., attempt to schedule on a different node type or try with less RAM, etc.).

Attempted Solutions

To view http://localhost:4646/ui/topology, one needs to have the following minimal permissions (topology-read.policy.hcl in the example below):

namespace "*" {
  capabilities = ["read-job"]
}
node {
  policy = "read"
}

To test this, run an ACL-enabled server:

nomad agent -dev -acl-enabled

Bootstrap ACLs

nomad acl bootstrap
export NOMAD_TOKEN=<bootstrapped token>
nomad acl policy apply -description "allows to read topology" topology-read topology-read.policy.hcl 
nomad acl token create -name topology-reader -policy topology-read
NOMAD_TOKEN=<new token> nomad ui -authenticate

You can change the topology-read.policy.hcl and re-apply it to try out other combinations, but in general I found out that these are the required policies to read the topology.

However, granting read-job capabilities across all namespaces is exactly what we do not want to do, as some projects are confidential and should not leak job specs (which might contain secrets or other confidential data – despite best efforts of educating about and promoting use of Vault secrets, template stanzas, etc.).

Alternative Solution

We are currently also considering setting up a service which gets the topology-read policy and essentially filters the json outputs such that we can provide our own visualization to our users. Since we already have an additional service to simplify scheduling of common jobs and fetching nomad job logs via OpenSearch, this could be an option, although it would require some additional service to be set up and maintained (as that service works with the nomad token for API access and does not have any "service" permissions in the background).

shoeffner avatar Oct 30 '24 11:10 shoeffner