kaniko
kaniko copied to clipboard
"gcr.io/kaniko-project/executor:latest" failed: step exited with non-zero status: 137
Actual behavior I am running a build on Cloud build. The build succeeds, but the caching snapshot at the end fails with the following messages:
Step #0: INFO[0154] Taking snapshot of full filesystem...
Finished Step #0 ERROR ERROR: build step 0 "gcr.io/kaniko-project/executor:latest" failed: step exited with non-zero status: 137
Expected behavior I would like the whole build to succeed - including caching.
To Reproduce Steps to reproduce the behavior:
- Build on GCP Cloud Build using a cloudbuild.yaml with Kaniko caching enabled.
Additional Information
I cannot provide the Dockerfile, but it is based on continuumio/miniconda3
and also installs tensorflow in a conda environment. I think it started failing after tensorflow was added to the list of dependencies.
Additionally - it builds fine with caching disabled and when a heavy 8 CPU machine type is used. However, I think it's strange that Kaniko caching requires more resources than the build itself.
I've been trying to work around this issue for the past several days. Kaniko consistently tries to use more memory than our kubernetes cluster has available. It only happens with our large images.
Any workaround available? My base image is tensorflow/tensorflow:2.4.0-gpu
which weighs 2.35 GB compressed.
@dakl try to downgrade to v.1.3.0 (as is mentioned in #1680). it works for me.
Any update on this topic? I have this issue on every ML related dockerfile where we need to use pytorch and other libs.
The :latest image is quite old, pointing to :v1.6.0 due to issues with :v1.7.0
It's possible the bug is fixed at head, and while we wait for a v1.8.0 release (#1871) you can try out the latest commit-tagged release and see if that helps: gcr.io/kaniko-project/executor:09e70e44d9e9a3fecfcf70cb809a654445837631
If it's not fixed, it sounds like we need to figure out where layer contents are being buffered into memory while being cached, which it sounds like was introduced some time between v1.3 and now. If anybody investigates and finds anything useful, please add it here.
Looks like it worked but I tried with cache disabled. On 1.6 even with cache disabled it was stopping. So good sign
any update for this issue ?, i am facing same problem when deploy ML image with sentence-transformers and torch>=1.6.0. the image size is more than 3 GB.
any update for this issue ?, i am facing same problem when deploy ML image with sentence-transformers and torch>=1.6.0. the image size is more than 3 GB.
It sounds like https://github.com/GoogleContainerTools/kaniko/issues/1669#issuecomment-1041541704 says this works with a newer commit-tagged image, and with caching disabled. It sounds like caching causes filesystem contents to be buffered in memory, which causes problems with large images.
The :latest image is quite old, pointing to :v1.6.0 due to issues with :v1.7.0
It's possible the bug is fixed at head, and while we wait for a v1.8.0 release (#1871) you can try out the latest commit-tagged release and see if that helps:
gcr.io/kaniko-project/executor:09e70e44d9e9a3fecfcf70cb809a654445837631
If it's not fixed, it sounds like we need to figure out where layer contents are being buffered into memory while being cached, which it sounds like was introduced some time between v1.3 and now. If anybody investigates and finds anything useful, please add it here.
happened to me too with a large image, and the referenced commit solved it. any update why its not solved yet in v1.8.1? @imjasonh
https://github.com/GoogleContainerTools/kaniko/issues/2115 is the issue tracking the next release. I don't have any more information than what's in that issue.
Does this issue still happen at the latest commit-tagged image? With and without caching enabled?
@imjasonh I am still experiencing this issue with latest
and v1.8.1
for an image with pytorch installed.
v1.3.0
seems to work as expected. Thank you @tk42 for the suggestion!
Any news on this? Still happening on v1.9.0
If you add --compressed-caching=false
it works for me on 1.9.0
--compressed-caching=false
worked well for most things except for COPY <src> <dst>
and it turns out theres also --cache-copy-layers
. I was still getting crushed by pytorch
installations.
This is the cloudbuild.yaml
that works really well now
steps:
- name: 'gcr.io/kaniko-project/executor:latest'
args:
- --destination=gcr.io/$PROJECT_ID/<name>
- --cache=true
- --cache-ttl=48h
- --compressed-caching=false
- --cache-copy-layers=true
I confirm I was having the same issue in Cloud Build and the --compressed-caching=false
solved the problem with :latest
so far.