Failed artifact streaming pull due to AKS node out of disk
Describe the bug When Artifact streaming is enabled on a Linux node in our Kubernetes cluster, we're experiencing problems with image pulls. Specifically, we're encountering "Failed to pull image" errors during deployments. Additionally, over time, the disk space on the node becomes filled up, leading to the eviction of all pods.
Observations: With Artifact Streaming Enabled on node:
Failed image pulls during deployments. Disk space gradually fills up over time. All pods eventually get evicted due to the lack of available disk space.
With Artifact Streaming Disabled on node: Deployments function as expected. Images are pulled correctly without errors. No significant disk space issues observed.
error: Failed to pull image ".azurecr.io/products/api:master": rpc error: code = Canceled desc = failed to pull and unpack image ".azurecr.io/products/api:master": failed to resolve reference "**.azurecr.io/products/api:master": failed to do request: Head "https://localhost:8578/v2/products/api/manifests/master?ns=.azurecr.io": context cancel
To Reproduce Steps to reproduce the behavior:
- Enable artifact streaming on ACR
- Enable Artifact Streaming on Node
- Deploy pods with image from ACR
Expected behavior Pods should be deployed
Screenshots If applicable, add screenshots to help explain your problem.
Any relevant environment information
- OS: Ubuntu 20.04
- AKS / nodePools version is 1.28.3
Additional context Add any other context about the problem here.
@maneeshcdls are you able to provide more detailed repro steps? It's been a while - but we can't exactly repro which parts of overlaybd's garbage collection on nodes needs improvement unless you are able to provide the exact sequence of repro'ing artifact streaming pulls on the node.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.
This issue was closed because it has been stalled for 30 days with no activity.