nextflow
nextflow copied to clipboard
nextflow uses default service account to launch pod in k8s cluster instead of nextflow service account
Bug report
We are trying to run a workflow on the EKS cluster. The workflow works well on minikube but on the EKS cluster it doesn't use the context to launch the nextflow pod. It uses system:serviceaccount:nextflow:default
service account.
We are running workflow like this:
nextflow kuberun https://<gitea-private-repository>
nextflow.config file:
process.executor = "k8s"
k8s {
namespace = "nextflow"
serviceAccount = "nextflow"
storageClaimName = "nextflow-pv-claim"
storageMountPath = "/nextflow/"
context = "<current-context>"
}
Expected behavior and actual behavior
On running the nextflow kuberun <gitea-private-repository>
it throws access denied:
Request GET /api/v1/namespaces/nextflow/persistentvolumeclaims/nextflow-pv-claim returned an error code=403
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "persistentvolumeclaims \"nextflow-pv-claim\" is forbidden: User \"system:serviceaccount:nextflow:default\" cannot get resource \"persistentvolumeclaims\" in API group \"\" in the namespace \"nextflow\"",
"reason": "Forbidden",
"details": {
"name": "nextflow-pv-claim",
"kind": "persistentvolumeclaims"
},
"code": 403
}
Ideally, it should use the context to launch nextflow pod.
Moreover, our service account nextflow
configs looks as follows:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: nextflow-role
rules:
- apiGroups:
- ""
resources:
- pods
- pods/status
verbs:
- get
- list
- watch
- create
- delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: nextflow-rolebinding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: nextflow-role
subjects:
- kind: ServiceAccount
name: nextflow
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: nextflow
The problem can be resolved by giving default
service account in the namespace appropriate permissions.
Steps to reproduce the problem
- Create ns
nextflow
- Create service account along with role and rolebinding.
- Create main.nf with any simple process
- Use nextflow.config shared above
- nextflow kuberun
- It should give the error.
Program output
Aug-13 11:28:04.101 [main] ERROR nextflow.cli.Launcher - @unknown
nextflow.k8s.client.K8sResponseException: Request GET /api/v1/namespaces/nextflow/persistentvolumeclaims/efs-claim returned an error code=403
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "persistentvolumeclaims \"efs-claim\" is forbidden: User \"system:serviceaccount:nextflow:default\" cannot get resource \"persistentvolumeclaims\" in API group \"\" in the namespace \"nextflow\"",
"reason": "Forbidden",
"details": {
"name": "efs-claim",
"kind": "persistentvolumeclaims"
},
"code": 403
}
at nextflow.k8s.client.K8sClient.makeRequest(K8sClient.groovy:370)
at nextflow.k8s.client.K8sClient.makeRequest(K8sClient.groovy)
at nextflow.k8s.client.K8sClient.get(K8sClient.groovy:379)
at nextflow.k8s.client.K8sClient.volumeClaimRead(K8sClient.groovy:429)
at nextflow.k8s.K8sConfig.checkStorageAndPaths(K8sConfig.groovy:216)
at nextflow.k8s.K8sConfig$checkStorageAndPaths$0.call(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:139)
at nextflow.k8s.K8sDriverLauncher.run(K8sDriverLauncher.groovy:123)
at nextflow.cli.CmdKubeRun.run(CmdKubeRun.groovy:75)
at nextflow.cli.Launcher.run(Launcher.groovy:475)
at nextflow.cli.Launcher.main(Launcher.groovy:657)
Environment
- Nextflow version: 21.04.3.5560
- Java version: openjdk 11.0.11
- Operating system: ubuntu 20-04
- Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Additional context
(Add any other context about the problem here)
@MrHassanMurtaza I am facing the same issue. Have you found an alternate way to launch the job with a custom serviceAccount?
Could be related to #1049
Getting the same error on GKE.
Question for anyone on this thread -- does this happen to you only when using kuberun
? What if you provision your own head pod and do nextflow run
instead? You can use this script to help you provision a head pod.
Just wanted to add that we are having this same problem with a k8s cluster hosted on-prem.
We are just starting out and trying to run:
nextflow kuberun login
With a nextflow.config of:
k8s {
namespace = "nextflow"
storageClaimName = "tommy-pvc"
storageMountPath = "/workspace"
}
We get a similar error:
❯ nextflow kuberun login
Request GET /api/v1/namespaces/nextflow/persistentvolumeclaims/tommy-pvc returned an error code=403
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "persistentvolumeclaims \"tommy-pvc\" is forbidden: User \"system:serviceaccount:default:default\" cannot get resource \"persistentvolumeclaims\" in API group \"\" in the namespace \"nextflow\"",
"reason": "Forbidden",
"details": {
"name": "tommy-pvc",
"kind": "persistentvolumeclaims"
},
"code": 403
}
Nextflow Version:
N E X T F L O W
version 21.10.6 build 5660
created 21-12-2021 16:55 UTC (11:55 EDT)
cite doi:10.1038/nbt.3820
http://nextflow.io
k8s version: 1.22.4
Happy to add any information that is needed.
@tjdurant Are you actually using a non-default service account in your nextflow config? I would expect you to have something like:
k8s {
namespace = "nextflow"
serviceAccount = "..."
storageClaimName = "tommy-pvc"
storageMountPath = "/workspace"
}
@bentsherman Thanks for the reply. Good point, so we created a 'nextflow' SA.
❯ kubectl get serviceaccounts | grep nextflow
nextflow 1 5m11s
Updated the nextflow.config in the k8s scope:
k8s {
namespace = "nextflow"
serviceAccount = "nextflow"
storageClaimName = "tommy-pvc"
storageMountPath = "/workspace"
}
re-ran nextflow kuberun login
:
Request GET /api/v1/namespaces/nextflow/persistentvolumeclaims/tommy-pvc returned an error code=403
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "persistentvolumeclaims \"tommy-pvc\" is forbidden: User \"system:serviceaccount:default:default\" cannot get resource \"persistentvolumeclaims\" in API group \"\" in the namespace \"nextflow\"",
"reason": "Forbidden",
"details": {
"name": "tommy-pvc",
"kind": "persistentvolumeclaims"
},
"code": 403
}
Looks like its using the default SA and namespace?
Okay just making sure. If you're okay with using the default service account then you can just add the role binding as shown in the OP. If you're still having issues you might need to add PVCs to the role as well:
resources:
- pods
- pods/status
- persistentvolumeclaims
By the way @MrHassanMurtaza I am looking into this issue but for some reason it doesn't happen on my local k8s cluster (I see it worked for you on minikube as well). I will try to reproduce it on an EKS cluster.
@bentsherman did the head node approach work for you when encountering this error?
@tjdurant You mean creating your own submitter pod instead of kuberun
? I don't think it did.
Finally able to reproduce this problem in EKS. I found a related problem, which is that if my kube config is set to the default
namespace, I get a similar error when specifying a non-default namespace to nextflow, but the error doesn't happen if I change my config to the nextflow namespace. Makes me wonder if nextflow is trying to perform some k8s operations before properly switching to the specified namespace / service account.
Hi folks. I ran into this same issue with the serviceAccount specified in the k8s scope in the nextflow.config file not being recognized. I saw the same error message 403. I'm running an EKS cluster with a PVC on AWS EFS storage. This is a workaround and not a proper solution, but I was able to get this working as @bentsherman suggests by creating a role (called nextflow-role) and binding the role to the default service account.
I found that I needed to include both persistentvolumeclaims
and configmaps
in the resources section of the role definition to get this to work.
For reference, the role definition I created (nextflow-role) is as follows:
ubuntu@ip-172-31-48-161:~$ kubectl get roles/nextflow-role -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","metadata":{"annotations":{},"name":"nextflow-role","namespace":"default"},"rules":[{"apiGroups":[""],"resources":["pods","pods/status","pods/log","pods/exec","jobs","jobs/status","jobs/log"],"verbs":["get","list","watch","create","delete"]},{"apiGroups":["apps"],"resources":["deployments"],"verbs":["get","list","watch","create","delete"]}]}
creationTimestamp: "2022-08-20T14:54:09Z"
name: nextflow-role
namespace: default
resourceVersion: "4095041"
uid: 1e427633-1710-4f00-b2c6-37cafdaf9e8e
rules:
- apiGroups:
- ""
resources:
- pods
- pods/status
- pods/log
- pods/exec
- jobs
- jobs/status
- jobs/log
- persistentvolumeclaims
- configmaps
verbs:
- get
- list
- watch
- create
- delete
- apiGroups:
- apps
resources:
- deployments
verbs:
- get
- list
- watch
- create
- delete
I then bound this role to the default service account using the following rolebinding:
ubuntu@ip-172-31-48-161:~$ kubectl get rolebinding/default-rolebind -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"RoleBinding","metadata":{"annotations":{},"name":"default-rolebind","namespace":"default"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"nextflow-role"},"subjects":[{"kind":"ServiceAccount","name":"default"}]}
creationTimestamp: "2022-08-20T16:15:47Z"
name: default-rolebind
namespace: default
resourceVersion: "4092303"
uid: 5dd585f6-f82a-4c9f-93e6-b62e0ef5cdf5
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: nextflow-role
subjects:
- kind: ServiceAccount
name: default
After applying these changes, pods running under the default service account had the permissions needed:
ubuntu@ip-172-31-48-161:~$ nextflow kuberun login -v efs-storage-claim:/mnt/scratch
Pod started: jolly-bhaskara
bash-4.2#
Okay, I think I figured out the problem.
Nextflow has a K8sClient
that makes HTTP requests to the K8s cluster. The HTTP requests use a bearer token that is provided by ClientConfig
, which is in turn initialized from the $KUBECONFIG
file (see ClientConfig.fromUserAndCluster()
). In other words, kuberun doesn't use the namespace or service account from the nextflow config to authenticate HTTP requests, only the kube config. This problem doesn't exist when Nextflow is already running in the K8s cluster, I think because the pod is already authenticated with the cluster.
To make kuberun use the desired namespace and service account, I had to update my kube config as follows:
# set current namespace
export NAMESPACE=tower-nf
kubectl config set-context --current --namespace ${NAMESPACE}
# set current user (i.e. service account)
export SERVICE_ACCOUNT=tower-launcher-sa
export SA_TOKEN_NAME=`kubectl -n ${NAMESPACE} get serviceaccount ${SERVICE_ACCOUNT} -o jsonpath='{.secrets[0].name}'`
export SA_TOKEN=`kubectl -n ${NAMESPACE} get secret ${SA_TOKEN_NAME} -o jsonpath='{.data.token}' | base64 --decode`
kubectl config set-credentials ${SERVICE_ACCOUNT} --token=${SA_TOKEN}
kubectl config set-context --current --user=${SERVICE_ACCOUNT}
Source: Oracle docs
I'm not sure yet how to make Nextflow do this part automatically. Presumably the ClientConfig
should just check the nextflow config first, but I don't know if the SA secret token needs to be in the kube config like in the above example. For now, I will just update the nf-k8s-best-practices guide, which will eventually make it into the main Nextflow docs.
Found the culprit I think: https://github.com/nextflow-io/nextflow/blob/ce7fa651c5fd2c0d7164ad0ecf0e2ef4e13ba8fd/modules/nextflow/src/main/groovy/nextflow/k8s/client/ConfigDiscovery.groovy#L154-L155
This method is essentially doing what I outlined in the previous comment. So we need to parameterize this method for the namespace and service account name. I will draft a PR.
Basically it's needed to specify the namespace on this command(s), isn't it?
kubectl describe secret $(kubectl get secrets | grep default | cut -f1 -d ' ') | grep -E '^token' | cut -f2 -d':' | tr -d '\t')
I'm having the same problems with GKE. Please note that Workload identity will be enabled by default soon, so this will be affecting all users of GKE when that is enabled
This has been solved in recent nextflow version (23.02.0-edge) or later. There's a getting started guide here
https://github.com/seqeralabs/wave-showcase/tree/master/example-gke
Thanks @pditommaso . I followed the recipe and that worked on GKE.