Kaniko seems to create full file system snapshots after each stage, leading to failed Gitlab CI pipeline.
I am trying to use Kaniko to build a multi-stage image within a Gitlab-CI pipeline. The pipeline crashes with the following, rather unhelpful message:
ERROR: Job failed: pod "runner-<id>" status is "Failed" right after Kaniko logs Taking snapshot of full filesystem....
The reason seems to be that the memory limit of the gitlab-pod is reached at some point and kubernetes auto-kills it. However, if built locally using regular docker, the resulting image is ~4GB and the gitlab-pod's memory limit should be much higher. This got me thinking and I created the following dockerfile to debug the problem:
FROM python:3.11 as dummy_stage_0
RUN echo "The test begins!"
FROM dummy_stage_0 AS dummy_stage_1
RUN echo "Congratulations, you reached level 1"
FROM dummy_stage_1 AS dummy_stage_2
RUN echo "Congratulations, you reached level 2"
FROM dummy_stage_2 AS dummy_stage_3
RUN echo "Congratulations, you reached level 3"
# this pattern continues for quite a while
When I build the image locally, the resulting size is exactly that of the base image. Docker takes roughly a minute to reach level 100. Using kaniko, the build fails after ~11 minutes with the aforementioned error while taking the snapshot of dummy_stage_47. The following parameters were used for the test:
stages:
- test
testing:
stage: test
tags:
- k8s
image:
name: gcr.io/kaniko-project/executor:debug
entrypoint: [""]
script:
- >-
/kaniko/executor
--skip-unused-stages
--use-new-run
--single-snapshot
--cache-run-layers=false
--cleanup
--reproducible
--snapshotMode=time
--context "${CI_PROJECT_DIR}"
--dockerfile "${CI_PROJECT_DIR}/Dockerfile"
--target dummy_stage_1000
--no-push
I guess Kaniko really creates snapshots of the full filesystem after each stage, which results in a huge memory consumption. Is this the expected behavior?
Have you tried with --compressed-caching=false, which takes quite a lot of memory (see other open issues about performance/memory)?
Yes, that was one of the first things I tried. For the original build it did not make a difference. I have not tried it again with the dummy build since I assumed it would only reduce the impact but not solve the underlying problem.
I just tested the --compressed-caching=false option and it does not solve the problem but reduced the impact. I ran the pipeline with and without the option and it failed at stages 32 and 41, respectively.
The gitlab shared runners run on a VM with only 4GB of memory. This is the cause for your crash.
Try a larger runner VM: https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html#machine-types-available-for-private-projects-x86-64
If you are using private runners, please post the memory limit configured for the gitlab build pod. It should be in your gitlab runner config.toml file (hopefully in your helm values.yaml file)
The test uses the python:3.11 image, which has a size of 340.88 MB. When run locally with docker, the resulting image has the same size. I would expect the test to work with 4GB memory. But my point is that kaniko seems to create snapshots for each stage. Thus, it makes a difference how many stages are used. A build that crashes if the RUNs are located in multiple stages might work if they are all merged into a single stage.
BTW, we use private Gitlab runners. Here's the memory usage during the build:
At stage 34 and a memory usage of ~11 GB, the build was stopped by kubernetes due to the memory consumption.
I've faced the same issue for my build on kubernetes. I'm using a git context and kaniko is using a lot of memory that the build job gets OOMKilled.
I tried adding --compressed-caching=false, no difference.
The logs are below:
Enumerating objects: 977, done.
Counting objects: 100% (912/912), done.
Compressing objects: 100% (492/492), done.
I let kaniko build just the first three stages of my test. Here you can see that each stage is saved in /kaniko/stages :
Thus, each stage adds the full image size, which quickly reaches the limits.
I guess this problem is related to #2275, #2249, and #1333
Hello everyone! I found solution here https://stackoverflow.com/questions/67748472/can-kaniko-take-snapshots-by-each-stage-not-each-run-or-copy-operation adding option to kaniko --single-snapshot
/kaniko/executor
--context "${CI_PROJECT_DIR}"
--dockerfile "${CI_PROJECT_DIR}/Dockerfile"
--destination "${YC_CI_REGISTRY}/${YC_CI_REGISTRY_ID}/${CI_PROJECT_PATH}:${CI_COMMIT_SHA}"
--single-snapshot
--single-snapshot was already included in all the tests I did. It did not seem to change the general behavior.
Yep, indeed --single-snapshot seems to not make any difference at all even for me.
Actually, it seems to be ignored at all: https://github.com/GoogleContainerTools/kaniko/issues/3215
Yep, indeed
--single-snapshotseems to not make any difference at all even for me. Actually, it seems to be ignored at all: #3215
I can confirm that too!
Additionally, I found that while the conda environment is retrieved from cache, the pip environment is not - even when it's unchanged:
INFO[0035] RUN mamba env create --file environment_conda.yml && conda clean -afy
INFO[0035] Found cached layer, extracting to filesystem
INFO[0080] SHELL ["/opt/conda/bin/conda", "run", "-n", "virtualfriend", "/bin/bash", "-c"]
INFO[0080] No files changed in this command, skipping snapshotting.
INFO[0080] COPY environment_pip.txt .
INFO[0080] Taking snapshot of files...
INFO[0080] RUN pip install --no-cache-dir -r environment_pip.txt
INFO[0080] Initializing snapshotter ...
INFO[0080] Taking snapshot of full filesystem...
Maintainers, why are there so many issues regarding this unresolved? Taking snapshot of full filesystem really times out the job so many times at 1:30min. In fact, it also hangs our cluster as a whole, having a lot of terminating Pods that are still active after 9 days. No cleanup... Please improve stability to this project... https://github.com/GoogleContainerTools/kaniko/issues/1333 https://github.com/GoogleContainerTools/kaniko/issues/970 https://github.com/GoogleContainerTools/kaniko/issues/1516
Could you please help us? We dont want taking snapshot
It is https://github.com/GoogleContainerTools/kaniko#flag---use-new-run and apparently this flag conflicts with the flags that make snapshots. I think they should not be used together
Could you please help us? We dont want taking snapshot
On my side, in the end, I switched the build to buildah.
This was also driven by the fact that the company where I work didn't want to provide internal support for kaniko builds anymore since the project is somehow unmaintained.
Still seeing this issue with 1.23.2. Is this looked at?
how about this? https://github.com/GoogleContainerTools/kaniko?tab=readme-ov-file#flag---snapshot-mode
how about this? https://github.com/GoogleContainerTools/kaniko?tab=readme-ov-file#flag---snapshot-mode
I tried the different options and they did not make much of a difference.
In my case, I'm running a CI/CD pipeline in Jenkins, and the container intermittently crashes (about 90% of the time) before Kaniko takes a snapshot, causing issues where files can't be found when trying to take the snapshot. I tried various flags to fix this, but none of them had any effect. To resolve this issue, I simply added a retry mechanism in Jenkins to make the process repeat multiple times in case of failure. Surprisingly, this solution worked, and the container no longer crashes, allowing the snapshot to be taken successfully.