kaniko Kaniko seems to create full file system snapshots after each stage, leading to failed Gitlab CI pipeline.

I am trying to use Kaniko to build a multi-stage image within a Gitlab-CI pipeline. The pipeline crashes with the following, rather unhelpful message: ERROR: Job failed: pod "runner-<id>" status is "Failed" right after Kaniko logs Taking snapshot of full filesystem.... The reason seems to be that the memory limit of the gitlab-pod is reached at some point and kubernetes auto-kills it. However, if built locally using regular docker, the resulting image is ~4GB and the gitlab-pod's memory limit should be much higher. This got me thinking and I created the following dockerfile to debug the problem:

FROM python:3.11 as dummy_stage_0
RUN echo "The test begins!"

FROM dummy_stage_0 AS dummy_stage_1
RUN echo "Congratulations, you reached level 1"

FROM dummy_stage_1 AS dummy_stage_2
RUN echo "Congratulations, you reached level 2"

FROM dummy_stage_2 AS dummy_stage_3
RUN echo "Congratulations, you reached level 3"

# this pattern continues for quite a while

When I build the image locally, the resulting size is exactly that of the base image. Docker takes roughly a minute to reach level 100. Using kaniko, the build fails after ~11 minutes with the aforementioned error while taking the snapshot of dummy_stage_47. The following parameters were used for the test:

stages:
  - test

testing:
  stage: test
  tags:
    - k8s
  image:
    name: gcr.io/kaniko-project/executor:debug
    entrypoint: [""]
  script:
    - >-
      /kaniko/executor
      --skip-unused-stages
      --use-new-run
      --single-snapshot
      --cache-run-layers=false
      --cleanup
      --reproducible
      --snapshotMode=time
      --context "${CI_PROJECT_DIR}"
      --dockerfile "${CI_PROJECT_DIR}/Dockerfile"
      --target dummy_stage_1000
      --no-push

I guess Kaniko really creates snapshots of the full filesystem after each stage, which results in a huge memory consumption. Is this the expected behavior?

Mar 24 '23 08:03 user1584

Have you tried with --compressed-caching=false, which takes quite a lot of memory (see other open issues about performance/memory)?

Mar 27 '23 19:03 TobiX

Yes, that was one of the first things I tried. For the original build it did not make a difference. I have not tried it again with the dummy build since I assumed it would only reduce the impact but not solve the underlying problem.

Mar 28 '23 07:03 user1584

I just tested the --compressed-caching=false option and it does not solve the problem but reduced the impact. I ran the pipeline with and without the option and it failed at stages 32 and 41, respectively.

Apr 14 '23 09:04 user1584

The gitlab shared runners run on a VM with only 4GB of memory. This is the cause for your crash.

Try a larger runner VM: https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html#machine-types-available-for-private-projects-x86-64

Apr 24 '23 15:04 tspearconquest

If you are using private runners, please post the memory limit configured for the gitlab build pod. It should be in your gitlab runner config.toml file (hopefully in your helm values.yaml file)

Apr 24 '23 15:04 tspearconquest

The test uses the python:3.11 image, which has a size of 340.88 MB. When run locally with docker, the resulting image has the same size. I would expect the test to work with 4GB memory. But my point is that kaniko seems to create snapshots for each stage. Thus, it makes a difference how many stages are used. A build that crashes if the RUNs are located in multiple stages might work if they are all merged into a single stage.

Apr 25 '23 08:04 user1584

BTW, we use private Gitlab runners. Here's the memory usage during the build: At stage 34 and a memory usage of ~11 GB, the build was stopped by kubernetes due to the memory consumption.

Apr 25 '23 14:04 user1584

I've faced the same issue for my build on kubernetes. I'm using a git context and kaniko is using a lot of memory that the build job gets OOMKilled.

I tried adding --compressed-caching=false, no difference.

The logs are below:

Enumerating objects: 977, done.
Counting objects: 100% (912/912), done.
Compressing objects: 100% (492/492), done.

May 03 '23 17:05 codezart

I let kaniko build just the first three stages of my test. Here you can see that each stage is saved in /kaniko/stages : Thus, each stage adds the full image size, which quickly reaches the limits. I guess this problem is related to #2275, #2249, and #1333

May 04 '23 10:05 user1584

Hello everyone! I found solution here https://stackoverflow.com/questions/67748472/can-kaniko-take-snapshots-by-each-stage-not-each-run-or-copy-operation adding option to kaniko --single-snapshot

  /kaniko/executor
  --context "${CI_PROJECT_DIR}"
  --dockerfile "${CI_PROJECT_DIR}/Dockerfile"
  --destination "${YC_CI_REGISTRY}/${YC_CI_REGISTRY_ID}/${CI_PROJECT_PATH}:${CI_COMMIT_SHA}"
  --single-snapshot

Aug 01 '23 07:08 aleksey-masl

--single-snapshot was already included in all the tests I did. It did not seem to change the general behavior.

Aug 01 '23 10:08 user1584

Yep, indeed --single-snapshot seems to not make any difference at all even for me. Actually, it seems to be ignored at all: https://github.com/GoogleContainerTools/kaniko/issues/3215

Jul 04 '24 06:07 cdprete

Yep, indeed --single-snapshot seems to not make any difference at all even for me. Actually, it seems to be ignored at all: #3215

I can confirm that too!

Additionally, I found that while the conda environment is retrieved from cache, the pip environment is not - even when it's unchanged:

INFO[0035] RUN mamba env create --file environment_conda.yml && conda clean -afy 
INFO[0035] Found cached layer, extracting to filesystem
INFO[0080] SHELL ["/opt/conda/bin/conda", "run", "-n", "virtualfriend", "/bin/bash", "-c"] 
INFO[0080] No files changed in this command, skipping snapshotting. 
INFO[0080] COPY environment_pip.txt .                   
INFO[0080] Taking snapshot of files...                  
INFO[0080] RUN pip install --no-cache-dir -r environment_pip.txt 
INFO[0080] Initializing snapshotter ...                 
INFO[0080] Taking snapshot of full filesystem...

Jul 08 '24 23:07 agilebean

Maintainers, why are there so many issues regarding this unresolved? Taking snapshot of full filesystem really times out the job so many times at 1:30min. In fact, it also hangs our cluster as a whole, having a lot of terminating Pods that are still active after 9 days. No cleanup... Please improve stability to this project... https://github.com/GoogleContainerTools/kaniko/issues/1333 https://github.com/GoogleContainerTools/kaniko/issues/970 https://github.com/GoogleContainerTools/kaniko/issues/1516

Oct 31 '24 15:10 Ruud-cb

Could you please help us? We dont want taking snapshot

Jan 06 '25 20:01 Najafov007

It is https://github.com/GoogleContainerTools/kaniko#flag---use-new-run and apparently this flag conflicts with the flags that make snapshots. I think they should not be used together

Jan 07 '25 08:01 aleksey-masl

Could you please help us? We dont want taking snapshot

On my side, in the end, I switched the build to buildah. This was also driven by the fact that the company where I work didn't want to provide internal support for kaniko builds anymore since the project is somehow unmaintained.

Jan 09 '25 16:01 cdprete

Still seeing this issue with 1.23.2. Is this looked at?

Jan 29 '25 21:01 dragosmc

how about this? https://github.com/GoogleContainerTools/kaniko?tab=readme-ov-file#flag---snapshot-mode

Feb 18 '25 11:02 pgacek

how about this? https://github.com/GoogleContainerTools/kaniko?tab=readme-ov-file#flag---snapshot-mode

I tried the different options and they did not make much of a difference.

Feb 19 '25 09:02 user1584

In my case, I'm running a CI/CD pipeline in Jenkins, and the container intermittently crashes (about 90% of the time) before Kaniko takes a snapshot, causing issues where files can't be found when trying to take the snapshot. I tried various flags to fix this, but none of them had any effect. To resolve this issue, I simply added a retry mechanism in Jenkins to make the process repeat multiple times in case of failure. Surprisingly, this solution worked, and the container no longer crashes, allowing the snapshot to be taken successfully.

Apr 09 '25 02:04 seung-juv