buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

Improved debugging support

Open tonistiigi opened this issue 4 years ago • 27 comments

addresses #1053 addresses #1470

An issue with the current build environment is that we often assume everyone can write a perfect Dockerfile from scratch without any mistakes. In real-world there is a lot of trial and error for writing a complex Dockerfile. Users get errors, need to understand what is causing them, and react accordingly.

In the legacy builder, one of the methods for dealing with this situation was to use --rm=false or look up the image ID of the last image layer from the build output and run docker run session with it to understand what was wrong. Buildkit does not create intermediate images nor make the containers it runs visible in docker run (both for very good reasons). Therefore this is even more complicated now and usually requires the user to set --target to do a partial build and the debug the output of it.

To improve this, we shouldn't try to bring back --rm=false that makes all the builds significantly slower and makes it impossible to manage storage for build cache. Instead, we could provide a better solution for this with a new --debugger flag.

Using --debugger on a build, should that build error, will take the user into a debugger shell similar to interactive docker run experience. There the user can see the error and use control commands to debug the actual cause.

If the error happened on a RUN command (execop in LLB), the user can use shell to rerun the command and keep tweaking it. This will happen in an identical environment to the one where execop runs, for example, this means access to secrets, ssh, cache mounts etc. They can also inspect the environment variables and files in the system that might be causing the issue. Using control commands, a user can switch between the broken state that was left behind by the failed command and the initial base state for that command. So in the case where they would try many possible fixes but end up in a bad state, they can just restore back to the initial state and start again.

If the error happened on a copy (or other file operation like rm), they can run ls and similar tools to find out why the file path is not correct and not working.

For implementation, this depends on https://github.com/moby/buildkit/issues/749 for support to run processes on build mounts directly without going through the solver. We would first start by modifying the Executor and ExecOp to instead of releasing the mounts after error, return them together with the error. I believe typed errors https://github.com/moby/buildkit/pull/1454 support can be reused for this. They should be returned up to the client Solve method, who can then decide to call llb.Exec with these mounts. If mounts are left unhandled, they are released with the gateway api release.

Once the debugging has completed, and the user has made changes to the source files, it is easy to trigger a restart of the build with exactly the same settings. This is also useful if you think you might be hitting a temporary error. If the retry didn't fix it, user is brought back to the debugger.

It might make sense to introduce a concept of "debugger image" that is used as a basis of the debugging environment. This would allow avoiding hardcoded logic in an opinionated area.

Later this could be extended with the step-based debugger, and source mapping support could be used to make source code changes directly in the editor or tracking dependencies in the build graph.

@hinshun

tonistiigi avatar May 02 '20 00:05 tonistiigi

Regarding the "debugger image", my colleague @slushie did some interesting work with sharing a mount namespace (partial containers) with a image that has debugging tools: https://github.com/slushie/cdbg

In that repository, there's a prototype of gdb in the debugging image, attaching to the process of a running container.

This may be useful to debug scratch images or minimal images that may not have the basic tools like a shell binary.

hinshun avatar May 02 '20 01:05 hinshun

/cc

fuweid avatar May 02 '20 07:05 fuweid

@coryb Now that Exec support has landed how big job do you estimate it to be to return the typed errors from execop/fileop that would allow running exec from the error position and position from the start of the op. Wondering if we should target that for v0.8 or not. We could potentially continue working on the client side ux after v0.8 is out. Already added #1714 to v0.8 that I think is a requirement.

tonistiigi avatar Oct 07 '20 23:10 tonistiigi

I am working on #1714 now, I am guessing a week+ before I have something viable for that.

I have not really looked into the change required for this yet. I think @hinshun has some ideas and is generally more familiar with this than I am. I will sync up with him and maybe twist his arm to help out 😄 I think we can try to break down what is remaining for this and try to come up with some estimates.

coryb avatar Oct 08 '20 00:10 coryb

Using --debugger on a build, should that build error, will take the user into a debugger shell similar to interactive docker run experience. There the user can see the error and use control commands to debug the actual cause.

Interactive shells being the only option is going to leave much to be desired when building in CI pipelines. I often use Docker in CI pipelines where the build command has no terminal to drop to or is a direct API call; having the only option be "run interactive" is not inline with current automated build best practices. Please consider an option to allow sideband inspection of buildkit layers, similar to how the legacy docker build works. Thanks.

ag-TJNII avatar Oct 23 '20 03:10 ag-TJNII

I've just upgraded Docker for Mac, which uses BUILDKIT as its default engine. Not feeling very comfortable with the suggested nsenter solution since the project is deprecated (or at least marked 'read-only'). Just wanted to give a +1 for getting this fixed. --debugger sounds like a great solution, maybe even letting it switch directly into interactive shell when a build step fails.

lyager avatar Mar 18 '21 12:03 lyager

Just wanted to follow up, changing the backend while building works for me: DOCKER_BUILDKIT=0 docker build . - but I must admit the speed of using buildkit is nice!

lyager avatar Mar 18 '21 13:03 lyager

I agree. Having the image of the layer immediately prior to the issue makes it incredibly handy to run an interactive container immediately prior to the problem to poke around.

I guess for now I will run DOCKER_BUILDKIT=0 docker build . as a work around when debugging new dockerfiles

so that I can get the image ids in the output again

Step 2/12 : WORKDIR /usr/src/app ---> Running in 14307a565858 Removing intermediate container 14307a565858 ---> 472b33608107 Step 3/12 : COPY ./package.json . ---> 40293e6966f5 Step 4/12 : COPY ./package-lock.json . ---> e91be6e9c9c6 Step 5/12 : RUN npm install ---> Running in dc762b24b192

$ docker run -it --rm e91be6e9c9c6 sh
/usr/src/app #

JoelTrain avatar Mar 23 '21 15:03 JoelTrain

Is there any solution in this space yet (that doesn't involve nsenter or regressing to DOCKER_BUILDKIT=0). I cant quite believe that it's coming up for 2 years since https://github.com/moby/buildkit/issues/1053 was raised and nobody has been able to debug docker buildkit builds since - it sounds like something that is as common a usecase as you could get?

Can't find any example of active work to resolve this issue, might step in and help out if there's nothing in the pipeline

gtmtech avatar Mar 23 '21 18:03 gtmtech

I don't know what you mean by nsenter solution but that is not recommended. What you can do is create a named target to the position of the dockerfile you want to debug, build that target with --target and run it with docker run.

tonistiigi avatar Mar 23 '21 19:03 tonistiigi

Just chiming in with a user perspective, after being put in a new environment where BUILDKIT appears to be the default, this is a decidedly worse experience than the past. Clearly the layers are being cached. I'd guess the simplest solution with a "backward compatible user experience" might be to just automatically export the last cached layer to the image store, and display its hash, whenever there is an error in docker build. Named targets for debugging feel like an awkward misuse of the feature, since the old way was "automatic."

matt2000 avatar Apr 06 '21 19:04 matt2000

@tonistiigi Do you plan to take this issue in development in any near future? Does it have blockers now?

strelga avatar Apr 14 '21 14:04 strelga

The --target option is not recognized by docker-compose build (version 1.28.5), so I'm sadly resorting to DOCKER_BUILDKIT=0.

itcarroll avatar Apr 23 '21 18:04 itcarroll

The --target option is not recognized by docker-compose build (version 1.28.5), so I'm sadly resorting to DOCKER_BUILDKIT=0.

Iirc, when using Compose, target is a field in the build: subsection of a service definition

edit: https://github.com/compose-spec/compose-spec/blob/master/build.md#target

KevOrr avatar Apr 23 '21 18:04 KevOrr

The proposed option mentioned in #1053 , where you can specify that it should create the image even on failure, would be very helpful. It would even be helpful if you could just enhance the --output option with a flag that it also outputs on failure.

willemm avatar Apr 29 '21 14:04 willemm

This would be fantastic. It's the only thing holding me back from moving over to buildkit full time!

emmahyde avatar May 29 '21 04:05 emmahyde

Just want to say that it is VERY painful to not be able to interatively debug intermediate images... It really makes debugging a problem in 5 min take a 2 Hour long process...

NicolasDorier avatar Jun 10 '21 04:06 NicolasDorier

After switching to buildkit recently because of the secret-mount option, I've just spent about half an hour trying to figure out what magical command I need to show the images in the buildkit cache, the apparent answer being "it's not possible". I find it hard to believe that this issue still persists...

cburgard avatar Jul 01 '21 06:07 cburgard

You can add a multi-stage split anywhere in your Dockerfile and use --target to build the portion you want to turn into a debug image.

tonistiigi avatar Jul 01 '21 06:07 tonistiigi

A temporary work-around is docker-compose, which (as of writing, v1.29.2) still doesn't use build kit when you do docker-compose run. You can create a simple docker-compose file with context: ., use docker-compose run --rm yourservice, which will then try to build it and print hash ids along the way. But if you use docker-compose build, it already uses buildkit, so this workaround is most likely on its way out. As is docker-compose itself, iirc?

hraban avatar Jul 21 '21 10:07 hraban

Just an a bit of information for people who are trying to figure out how to enter into a debug state. It may be helpful to spell out tonistiigi's work around! If you're just figuring docker out it might not be obvious what they mean when they say this. Here's a quick guide:

Lets say you have this Dockerfile

FROM archlinux:latest

# Initial package load
RUN pacman -Syu --noconfirm
RUN pacman -S --needed --noconfirm git base-devel sudo bash fish

RUN explode

# User
RUN useradd -m user\
 && echo "user ALL=(ALL) NOPASSWD:ALL" >/etc/sudoers.d/user
USER user
WORKDIR /home/user

I run docker buildx build --load -t arch . to build it, but it blows up at RUN explode. I wanna debug it.

First, Modify the starting FROM like this:

FROM archlinux:latest as working

then add this right before the break point

FROM working
RUN explode

Now just run docker builtx build --load --target working -t arch . && docker run -it arch sh

Now you're in right before the command that blew up. Hope that helps debugging!

Ghoughpteighbteau avatar Oct 12 '21 21:10 Ghoughpteighbteau

Even if it is not yet possible to run containers on intermediary layers in the build kit cache, is there a way that one could extract the cache layers to view as a filesystem diff?

Aposhian avatar Dec 17 '21 07:12 Aposhian

I added this comment on #1470 as I don't think this issue fully represents the problem identified by #1470. Basically multi-stage builds where multiple images should be exported is possibly not a common, but very useful technique for speeding up CI/CD builds.

This requires not debugging support, but something more akin to a Dockerfile command to explicitly push a stage, supporting multiple --target parameters or similar.

Running multiple docker build with different --target options does not work as it is not composable.

alexanderkjeldaas avatar Feb 03 '22 11:02 alexanderkjeldaas

This can give you a look at a the point after a successfully completed stage:

DOCKER_BUILDKIT=1 docker build --target <stage> -t test . docker run --rm -it test bash

But unlike when DOCKER_BUILDKIT=0, I don't think there's a way to see the hash for each layer created in the image so you can't just jump in right before the error and test at the moment of failure.

Highly unfortunate, and a big deal if you ask me!

chrisawad avatar Mar 22 '22 21:03 chrisawad

$ docker --version
Docker version 20.10.14

DOCKER_BUILDKIT=0 docker build .. doesn't seem to work anymore. I no longer get the hashes

kingbuzzman avatar Apr 18 '22 20:04 kingbuzzman

FYI:

I'm recently implemented an experimental interactive debugger for Dockerfile : buildg https://github.com/ktock/buildg

Also in buildx, discussion is ongoing towards interactive debugger support and UI/UX: https://github.com/docker/buildx/issues/1104

ktock avatar May 10 '22 13:05 ktock

  • If the buildkit removes the intermediate container when build failure, how can I docker commit to debug that layer?
    • DOCKER_BUILDKIT=0 works for me in this case
  • But is there an official best practice to debug failure build layer with buildkit on?(because I do like the buildkit's logging tho)

yambottle avatar Jul 15 '22 17:07 yambottle

It's been quite some time since there's been movement here. Can we get an update on this?

terekcampbell avatar Jan 31 '23 20:01 terekcampbell

I fully support the idea of getting the hashes of each layer back. Maybe a good compromise would be to at least display the hash of the layer a failing command was run in?

ptrxyz avatar Feb 13 '23 10:02 ptrxyz

Hashes of each later would help so much.

rfay avatar Feb 13 '23 15:02 rfay