azure-workload-identity
azure-workload-identity copied to clipboard
azure.workload.identity/inject-proxy-sidecar blocks jobs in Kubernetes
Describe the bug
We have some jobs and cronjobs that is running in AKS that connects to an Azure SQL database using ODBC. We are planning to use Managed Identity and workload identity to do the authentication in the ODBC driver, and for this we need to use the injection of the proxy sidecar (for some reason).
But, by doing this, the job won't end after the job container has successfully completed since the sidecar proxy is still alive after our container is done.
The job pod is in state NotReady
as the proxy container is still running.
NAME READY STATUS RESTARTS AGE
pod/job-onetime-42mx8 1/2 NotReady 0 26m
Here is the dump of the pod:
Name: job-onetime-42mx8
Namespace: job-jobs
Priority: 0
Service Account: job-serviceaccount
Node: aks-systempool-85002938-vmss000002/10.1.0.4
Start Time: Wed, 01 Mar 2023 16:30:24 +0100
Labels: azure.workload.identity/use=true
controller-uid=9f307dd8-31d6-4c03-b482-563d7fece75e
job-name=job-onetime
Annotations: azure.workload.identity/inject-proxy-sidecar: true
Status: Running
IP: 10.2.0.19
IPs:
IP: 10.2.0.19
Controlled By: Job/job-onetime
Init Containers:
azwi-proxy-init:
Container ID: containerd://148eba569dabca978fb85499303ec4b8de859b6300957962ad5ba9fbf2773008
Image: mcr.microsoft.com/oss/azure/workload-identity/proxy-init:v0.15.0
Image ID: mcr.microsoft.com/oss/azure/workload-identity/proxy-init@sha256:e8064cf26147bb98efe33c5bc823eb3b32c6b0cbf93619fa6b5d72f4f7a7c068
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 01 Mar 2023 16:30:25 +0100
Finished: Wed, 01 Mar 2023 16:30:25 +0100
Ready: True
Restart Count: 0
Environment:
PROXY_PORT: 8000
AZURE_CLIENT_ID:
AZURE_TENANT_ID:
AZURE_FEDERATED_TOKEN_FILE: /var/run/secrets/azure/tokens/azure-identity-token
AZURE_AUTHORITY_HOST: https://login.microsoftonline.com/
Mounts:
/var/run/secrets/azure/tokens from azure-identity-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nwvjg (ro)
Containers:
job-onetime:
Container ID: containerd://0b3d851e1618602fb09b0d5ca0d49ae265fea720d19d7ef45608face3495fd85
Image: registry/job-onetime:local-20230301.01
Image ID: registry/job-onetime@sha256:6da19c0f646ae6d245c9f2c342d9da961540f92af3968bfcf5d58fb4015da501
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 01 Mar 2023 16:30:26 +0100
Finished: Wed, 01 Mar 2023 16:30:29 +0100
Ready: False
Restart Count: 0
Environment:
CONNECTION_STRING: <set to the key 'connection-string' in secret 'job-secrets'> Optional: false
AZURE_CLIENT_ID:
AZURE_TENANT_ID:
AZURE_FEDERATED_TOKEN_FILE: /var/run/secrets/azure/tokens/azure-identity-token
AZURE_AUTHORITY_HOST: https://login.microsoftonline.com/
Mounts:
/mnt/secrets-store from job-secrets-store (ro)
/var/run/secrets/azure/tokens from azure-identity-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nwvjg (ro)
azwi-proxy:
Container ID: containerd://d444c6294a2cf71bb7564e81f3b2d4677e29a6cd59f54fb5edc415f55890512c
Image: mcr.microsoft.com/oss/azure/workload-identity/proxy:v0.15.0
Image ID: mcr.microsoft.com/oss/azure/workload-identity/proxy@sha256:809dea7d3099c640a7d0b87f63092c97177992cb47abb141b6a6203feb32d071
Port: 8000/TCP
Host Port: 0/TCP
Args:
--proxy-port=8000
State: Running
Started: Wed, 01 Mar 2023 16:30:27 +0100
Ready: True
Restart Count: 0
Environment:
AZURE_CLIENT_ID:
AZURE_TENANT_ID:
AZURE_FEDERATED_TOKEN_FILE: /var/run/secrets/azure/tokens/azure-identity-token
AZURE_AUTHORITY_HOST: https://login.microsoftonline.com/
Mounts:
/var/run/secrets/azure/tokens from azure-identity-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nwvjg (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
job-secrets-store:
Type: CSI (a Container Storage Interface (CSI) volume source)
Driver: secrets-store.csi.k8s.io
FSType:
ReadOnly: true
VolumeAttributes: secretProviderClass=job-kv-secrets
kube-api-access-nwvjg:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
azure-identity-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3600
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m12s default-scheduler Successfully assigned job-jobs/job-onetime-42mx8 to aks-systempool-85002938-vmss000002
Normal Pulled 3m12s kubelet Container image "mcr.microsoft.com/oss/azure/workload-identity/proxy-init:v0.15.0" already present on machine
Normal Created 3m12s kubelet Created container azwi-proxy-init
Normal Started 3m12s kubelet Started container azwi-proxy-init
Normal Pulling 3m11s kubelet Pulling image "registry/job-onetime:local-20230301.01"
Normal Pulled 3m11s kubelet Successfully pulled image "registry/job-onetime:local-20230301.01" in 228.98641ms
Normal Created 3m11s kubelet Created container job-init
Normal Started 3m11s kubelet Started container job-init
Normal Pulled 3m11s kubelet Container image "mcr.microsoft.com/oss/azure/workload-identity/proxy:v0.15.0" already present on machine
Normal Created 3m10s kubelet Created container azwi-proxy
Normal Started 3m10s kubelet Started container azwi-proxy
Steps to reproduce
- Create a workload identity enabled job that sets the
azure.workload.identity/inject-proxy-sidecar
annotation totrue
- Wait for the job to finish.
Expected behavior The best would of course be that ODBC works with the default flow, but somehow it doesn't so that we need to use the sidecar.
The sidecar should be stopped whenever the other container(s) in the pod has been completed enabling the job to complete.
Logs
Environment
- Kubernetes version (use
kubectl version
): Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"b969368e201e1f09440892d03007c62e791091f8", GitTreeState:"clean", BuildDate:"2022-12-16T19:44:08Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"} - Cloud provider or hardware configuration: Azure Kubernetes Services
- OS (e.g:
cat /etc/os-release
): - Kernel (e.g.
uname -a
): - Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:
Additional context
This also happens if you use Argo Workflows and try to inject in the workflows. The proxy keeps running while all containers have exited. The Argo documentation describes how they handle injected sidecars. The issue seems to be that they try to send a kill signal using kubectl exec
and that fails. There is a way how to customize it but the azwi-proxy is based on a very thin linux distroless base image that has no shell.
One option to resolve this, at least from my point of view, would be to compile the proxy with the option to terminate itself.
Took me days to figure this one out. As @san7hos mentions, if you could issue a pkill to the sidecar proxy, then that should do it. But the base image uses barebones distroless. The azwi webhook helm chart also doesn't really give you much options to change the proxy image even if you decided to build your own.
I considered other options to gracefully kill the sidecar. One potential was to use the OpenKruise Job Sidecar Terminator. But in order for that to work, the proxy container needs an environment variable injected. Again, The azwi webhook helm chart also doesn't really give you any options to do so.
Like always, I had to scour the depths of the internet for bits and pieces of poorly written Azure documentation that are sprawled here and there, put them together, to figure out a solution. The solution was actually to properly use azwi as intended. Rather than use the proxy sidecar to intercept the IDMS endpoint when odbc tries to authenticate via the Msi method, just use the projected service account token to authenticate.
I used msal to get the token, followed this half-baked solution, and authenticate to the database via access token.
Here is sample code I tested with on a pod without a sidecar. Disclaimer: I only tested on pyodbc.
If the mutating webhook controller used native sidecar containers then I think it would resolve this (plus some annoying issues with the proxy being the default container). I might raise a PR