kaniko icon indicating copy to clipboard operation
kaniko copied to clipboard

Symlink issue on k8s GPU node with /lib/firmware/nvidia/525.xx

Open maltegrosse opened this issue 1 year ago • 4 comments

Actual behavior After updating and upgrading to the latest nvidia drivers, kaniko runs into issues: error building image: error building stage: failed to get filesystem from image: error removing lib to make way for new symlink: unlinkat //lib/firmware/nvidia/525.147.05/gsp_ad10x.bin: device or resource busy

  • using --ignore-path=/lib/firmware/nvidia/525.147.05 is not considered, using /lib as ignore path breaks obvious other things during the build process

Expected behavior A week ago, before the nvidia update, the container build run without issues.

To Reproduce Steps to reproduce the behavior:

  1. using k8s node with nvidia gpu driver installed
  2. using woodpecker-ci with kaniko plugin

Additional Information

  • Dockerfile
ARG LAB_IMAGE=quay.io/jupyter/scipy-notebook:lab-4.0.12
FROM ${LAB_IMAGE}
RUN pip install ipywebrtc==0.6.0 
  • Build Context No addition add/copy commands are used By using ignote-path pointing to the exact firmware folder (/lib/firmware/nvidia/525.147.05), kaniko still shows DEBU[0001] Ignore list: .... {/lib/firmware/nvidia/525.147.05/gsp_ad10x.bin false}

  • Kaniko Image (fully qualified with digest) gcr.io/kaniko-project/executor:v1.19.2-debug

https://github.com/woodpecker-ci/plugin-kaniko/blob/main/Dockerfile

Triage Notes for the Maintainers

Description Yes/No
Please check if this a new feature you are proposing
  • - [x]
Please check if the build works in docker but not in kaniko
  • - [x]
Please check if this error is seen when you use --cache flag
  • - [x]
Please check if your dockerfile is a multistage dockerfile
  • - [x]

maltegrosse avatar Feb 13 '24 07:02 maltegrosse

Same problem, newer version: //lib/firmware/nvidia/535.54.03/gsp_ga10x.bin

Coder envbuilder has a workaround for ignore-paths https://github.com/coder/envbuilder/blob/8d3cfdffc3ab221a5d224418259128c46dd51a86/envbuilder.go#L544

But, after ignoring nvidia, I have to ignore /var/run and then this happens:

Failed to build: error building stage: failed to execute command: starting command: fork/exec /bin/sh: no such file or directory

marrotte avatar Mar 12 '24 20:03 marrotte

Same problem.

It works with docker build, but it does not work with kaniko.

Did you maybe @maltegrosse or @marrotte found a workaround?

I was thinking to get back to DIND for nvidia/cuda image builds, but I am not mean on doing it if there is a better way.... Something changed in nvidia/cuda image retroacively and since then it is failing....

bvidovic1 avatar Mar 26 '24 09:03 bvidovic1

I couldnt get it running anymore, thats why I switched to buildah

maltegrosse avatar Mar 27 '24 03:03 maltegrosse

Yeah, I added additional CI templates based on DIND to build images with this problem (in my case nvidia/cuda) hoping it will get fixed soon.

bvidovic1 avatar Mar 27 '24 15:03 bvidovic1