buildkit icon indicating copy to clipboard operation
buildkit copied to clipboard

Feature request: On Dockerfiles, allow filepaths to be excluded from ADD and COPY commands

Open leandrosansilva opened this issue 2 years ago • 10 comments

This feature will be useful for multi stage builds, where some files should be copied in some stages but not others.

I imagine it looking like it follows:

FROM myimage AS not_py
COPY --exclude=sources/*.py sources  /src

FROM myimage AS not_cpp_or_txt
COPY --exclude=sources/*.cpp --exclude=/sources/*.txt sources  /src

In such example, /sources has some files required by not_py but not by not_cpp_or_txt, and vice-versa.

As far as I am aware, such construct at the moment cannot be accomplished with .dockerignore, which seems to prevent files to be copied to the context, whereas the exclude proposed adds files to the context, but not to the commands that use it.

leandrosansilva avatar Nov 25 '23 11:11 leandrosansilva

.dockerignore, which seems to prevent files to be copied to the context, whereas the exclude proposed adds files to the context, but not to the commands that use it.

What is the benefit of that? We should not spend time transferring files to the daemon where they would be excluded by the copy step later. This doesn't mean that copy side shouldn't filter at all (copies between stages, multiple copies with different filters) but it should happen both for local source transfer and for copy.

I'm not so sure about the proposed syntax. It looks like it can become quite verbose if used as a replacement for a dockerignore. Maybe some way to define a new stage based on another filter with an additional filter definition, where the filter has multiple rules. Not sure exactly but https://github.com/moby/buildkit/issues/4239 also goes for a way to define new stages. It is also possible that this case is different enough from the dockerignore case and there will not be too many rules.

Some other considerations:

  • Do we expect many copies to use same excludes? Then they should be defined together somehow.
  • Is it ok that there are excludes but not includes? --parents is using IncludePatterns internally I think.

@thaJeztah @AkihiroSuda @cpuguy83

tonistiigi avatar Nov 27 '23 22:11 tonistiigi

Thx @tonistiigi for reviewing this issue.

I can see that there were other proposals over time, in addition to the one you linked, such as:

And this is a very similar proposal (meaning the current issue is actually a duplicate)

  • https://github.com/moby/buildkit/issues/2853

I mentioned the context in the original text, but in retrospect it was probably a mistake, as I confess I don't understand in details how the context works. The main target for this feature is actually the build cache.

leandrosansilva avatar Nov 29 '23 14:11 leandrosansilva

I think a generic --filter flag would be fine so we can handle includes as well:

FROM myimage AS not_py
COPY --filter=!*.py sources /src

FROM myimage AS not_cpp_or_txt
# OR
COPY --filter=!*.cpp --filter=!*.txt sources /src

FROM myimage AS combine
# AND
COPY --filter=!*.cpp,!*.py sources /src

FROM myimage AS only_cpp
COPY --filter=*.cpp sources /src

FROM myimage AS all
# default behavior as if we don't set --filter flag
COPY --filter=** sources /src

For more control we can use a comma separated list of match object instead of a simple path glob (combine stage). If multiple filters are set, they are combined using the OR operation, while individual match rules within an object are combined using the AND operation. By incorporating ! negation, you can create intricate matching rules.

crazy-max avatar Nov 30 '23 10:11 crazy-max

What is the benefit of that? We should not spend time transferring files to the daemon where they would be excluded by the copy step later. This doesn't mean that copy side shouldn't filter at all (copies between stages, multiple copies with different filters) but it should happen both for local source transfer and for copy.

and

Is it ok that there are excludes but not includes?

:point_right: This is just a first "quick" blurb, but I know there's been various ticket around this (or similar) proposals, and we may have to go through use-cases mentioned in those.

  • some use-cases evolved around multiple targets using different sets of files in the same context; having a single .dockerignore made that more complicated (or not possible)
  • use-cases where --mount is used to mount files, but COPY statement to not include those files (or directories).
  • use-cases where the context contains a large number of files and directories from which the majority of files have to be included, except from some files (these may be generated files, or other artifacts). Describing files to include through COPY and wildcards is much more complicated than to describe what files to exclude. The --parents option alleviates some use-cases here, but not all of those.

I'm not so sure about the proposed syntax. It looks like it can become quite verbose if used as a replacement for a dockerignore.

I definitely think this can be a concern. Using a --flag for this works for basic use-cases, but can get complicated fast. I know there's been some proposals to have --ignorefile=xxx to specify a .dockerignore which could alleviate some of that, but also may be opening a new can of worms.

thaJeztah avatar Nov 30 '23 14:11 thaJeztah

  • use-cases where --mount is used to mount files, but COPY statement to not include those files (or directories).

This is my use case 🙋‍♂️

Specifically, I'm generating some "clean" versions of my package.json and package-lock.json that doesn't contain the version property, in order to get better caching.

My Dockerfile looks something like this:

FROM public.ecr.aws/lambda/nodejs:20.2023.11.21.13

RUN dnf install -y gcc gcc-c++ git make openssl-devel tar zip
WORKDIR /var/task

# This layer can be cached even between version bumps (or any other metadata changes that doesn't affect deps)
RUN --mount=type=bind,source=clean-package.json,target=package.json --mount=type=bind,source=clean-package-lock.json,target=package-lock.json npm ci

# These two layers would be condensed to one:
COPY . .
RUN rm clean-package.json clean-package-lock.json

RUN npm run-script build

For me, it would be great with a simple --exclude flag to COPY:

COPY --exclude=clean-package.json --exclude=clean-package-lock.json . .

This is not meant to replace .dockerignore, which we already use to ignore a lot of files.

LinusU avatar Dec 06 '23 16:12 LinusU

We discussed this in one of the moby maintainers meetings.

Most people prefered the exclude over filter/include to cover the most basic cases. And possibly extend this later if reusing same filters becomes a problem. But exclude should be enough to test this in dockerfile labs channel.

Regarding implementation:

  • The exclude rules need to be integrated to context transfer as well, in addition to the filter during copy. If big files are excluded and never used during build then time should not be wasted to transfer them to the daemon. This filtering already works this way for the source paths and dockerignore paths.
  • @aaronlehmann is the current contenthash cache implementation (#2082) already covering all cases for this, including ** etc. so that no excluded file impacts the cache checksum? If not then a workaround would be to do a internal copy to scratch stage first and then let next copy recompute the checksum. Hopefully this is not needed but needs to be verified with integration tests as well.

tonistiigi avatar Jan 08 '24 23:01 tonistiigi

is the current contenthash cache implementation (https://github.com/moby/buildkit/pull/2082) already covering all cases for this, including ** etc. so that no excluded file impacts the cache checksum?

I believe so. That is the intent.

aaronlehmann avatar Jan 08 '24 23:01 aaronlehmann

Regarding implementation:

  • The exclude rules need to be integrated to context transfer as well, in addition to the filter during copy. If big files are excluded and never used during build then time should not be wasted to transfer them to the daemon. This filtering already works this way for the source paths and dockerignore paths.

If I understood properly, this mean that a file would need not be added to the context if it's ignored by .dockerignore AND from stages defined. Is it correct?

In this case, a file which is excluded (via --exclude) from all but one build stage must be added to the context.

I wonder if for this specific use case simply using .dockerignore would not be a better option and how much of an edge case it'd be.

I've created a change for the second suggestion (the cache) on https://github.com/moby/buildkit/pull/4561 and would be happy to get some feedback.

Thank you in advance.

leandrosansilva avatar Jan 16 '24 20:01 leandrosansilva

AND from stages defined. Is it correct?

and from stages in use. Dockerfile frontend will already determine if a stage is reachable for a given build.

In this case, a file which is excluded (via --exclude) from all but one build stage must be added to the context.

Yes, there are cases where we can't be super precise about it. I think if there are multiple active copies, one has exclude and another one is using the excluded files then we should still transfer with one context and ignore the excludes (let the excludes happen via copy). But for simple cases where there is only one copy or all excludes are the same it should be possible to do the filter on transfer. I'm not sure how far we should go in determining if exclude from one COPY can collide with a path from another COPY. Just handling the most basic cases is probably fine initially. If common pattern appears later that is not handled by this case it can be improved later.

tonistiigi avatar Jan 16 '24 20:01 tonistiigi

I am continuing the discussion in the proposed PR https://github.com/moby/buildkit/pull/4561

leandrosansilva avatar Jan 17 '24 17:01 leandrosansilva