kaniko icon indicating copy to clipboard operation
kaniko copied to clipboard

"gcr.io/kaniko-project/executor:latest" failed: step exited with non-zero status: 137

Open snthibaud opened this issue 3 years ago • 14 comments

Actual behavior I am running a build on Cloud build. The build succeeds, but the caching snapshot at the end fails with the following messages:

Step #0: INFO[0154] Taking snapshot of full filesystem...
Finished Step #0 ERROR ERROR: build step 0 "gcr.io/kaniko-project/executor:latest" failed: step exited with non-zero status: 137

Expected behavior I would like the whole build to succeed - including caching.

To Reproduce Steps to reproduce the behavior:

  1. Build on GCP Cloud Build using a cloudbuild.yaml with Kaniko caching enabled.

Additional Information I cannot provide the Dockerfile, but it is based on continuumio/miniconda3 and also installs tensorflow in a conda environment. I think it started failing after tensorflow was added to the list of dependencies.

snthibaud avatar Jun 13 '21 13:06 snthibaud

Additionally - it builds fine with caching disabled and when a heavy 8 CPU machine type is used. However, I think it's strange that Kaniko caching requires more resources than the build itself.

snthibaud avatar Jun 13 '21 13:06 snthibaud

I've been trying to work around this issue for the past several days. Kaniko consistently tries to use more memory than our kubernetes cluster has available. It only happens with our large images.

hugbubby avatar Jun 13 '21 13:06 hugbubby

Any workaround available? My base image is tensorflow/tensorflow:2.4.0-gpu which weighs 2.35 GB compressed.

dakl avatar Jul 01 '21 05:07 dakl

@dakl try to downgrade to v.1.3.0 (as is mentioned in #1680). it works for me.

tk42 avatar Jul 08 '21 14:07 tk42

Any update on this topic? I have this issue on every ML related dockerfile where we need to use pytorch and other libs.

Mistic92 avatar Feb 16 '22 13:02 Mistic92

The :latest image is quite old, pointing to :v1.6.0 due to issues with :v1.7.0

It's possible the bug is fixed at head, and while we wait for a v1.8.0 release (#1871) you can try out the latest commit-tagged release and see if that helps: gcr.io/kaniko-project/executor:09e70e44d9e9a3fecfcf70cb809a654445837631

If it's not fixed, it sounds like we need to figure out where layer contents are being buffered into memory while being cached, which it sounds like was introduced some time between v1.3 and now. If anybody investigates and finds anything useful, please add it here.

imjasonh avatar Feb 16 '22 14:02 imjasonh

Looks like it worked but I tried with cache disabled. On 1.6 even with cache disabled it was stopping. So good sign

Mistic92 avatar Feb 16 '22 14:02 Mistic92

any update for this issue ?, i am facing same problem when deploy ML image with sentence-transformers and torch>=1.6.0. the image size is more than 3 GB.

wahyueko22 avatar Mar 08 '22 06:03 wahyueko22

any update for this issue ?, i am facing same problem when deploy ML image with sentence-transformers and torch>=1.6.0. the image size is more than 3 GB.

It sounds like https://github.com/GoogleContainerTools/kaniko/issues/1669#issuecomment-1041541704 says this works with a newer commit-tagged image, and with caching disabled. It sounds like caching causes filesystem contents to be buffered in memory, which causes problems with large images.

imjasonh avatar Mar 08 '22 14:03 imjasonh

The :latest image is quite old, pointing to :v1.6.0 due to issues with :v1.7.0

It's possible the bug is fixed at head, and while we wait for a v1.8.0 release (#1871) you can try out the latest commit-tagged release and see if that helps: gcr.io/kaniko-project/executor:09e70e44d9e9a3fecfcf70cb809a654445837631

If it's not fixed, it sounds like we need to figure out where layer contents are being buffered into memory while being cached, which it sounds like was introduced some time between v1.3 and now. If anybody investigates and finds anything useful, please add it here.

happened to me too with a large image, and the referenced commit solved it. any update why its not solved yet in v1.8.1? @imjasonh

lappazos avatar Jul 24 '22 15:07 lappazos

https://github.com/GoogleContainerTools/kaniko/issues/2115 is the issue tracking the next release. I don't have any more information than what's in that issue.

imjasonh avatar Jul 24 '22 15:07 imjasonh

Does this issue still happen at the latest commit-tagged image? With and without caching enabled?

imjasonh avatar Jul 24 '22 15:07 imjasonh

@imjasonh I am still experiencing this issue with latest and v1.8.1 for an image with pytorch installed.

v1.3.0 seems to work as expected. Thank you @tk42 for the suggestion!

granthamtaylor avatar Aug 07 '22 00:08 granthamtaylor

Any news on this? Still happening on v1.9.0

irg1008 avatar Sep 07 '22 10:09 irg1008

If you add --compressed-caching=false it works for me on 1.9.0

spookyuser avatar Sep 28 '22 11:09 spookyuser

--compressed-caching=false worked well for most things except for COPY <src> <dst> and it turns out theres also --cache-copy-layers. I was still getting crushed by pytorch installations.

This is the cloudbuild.yaml that works really well now

steps:
- name: 'gcr.io/kaniko-project/executor:latest'
  args:
  - --destination=gcr.io/$PROJECT_ID/<name>
  - --cache=true
  - --cache-ttl=48h
  - --compressed-caching=false
  - --cache-copy-layers=true

jtwigg avatar Mar 29 '23 05:03 jtwigg

I confirm I was having the same issue in Cloud Build and the --compressed-caching=false solved the problem with :latest so far.

javiercornejo avatar Nov 23 '23 21:11 javiercornejo