buildah icon indicating copy to clipboard operation
buildah copied to clipboard

buildah is unable to umount or delete working container if storage is not empty

Open a-skr opened this issue 2 months ago • 6 comments

Issue Description

Hi,

I experience issues when using buildah with the overlay storage driver in rootless mode.

The issue is: I cannot unmount or delete a working container if storage is not empty.

This issue has been observed with buildah 1.39.3 (on debian 13) and 1.41.5 (on a fresh minimal debian testing virtual machine with default configuration)

vfs storage driver works as expected. The issue is with overlay (in rootless mode).

Related Debian issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1119718

Steps to reproduce the issue

How to reproduce:

container=$(buildah from scratch)
mountpoint=$(buildah unshare buildah mount $container)
cd $mountpoint
touch HELLO
cd -

Describe the results you received

From there, I cannot umount the container:

buildah unshare buildah umount $container

Error: error unmounting container "working-container": unmounting build container "c5034a07d23849341d1c97f1d9a0902234579fb60624959f53c614d22970c801": replacing mount point "/home/skrzynia/.local/share/containers/storage/overlay/c2387345ed68d84ad6c67b915b4784e40317ca97a0522ea75e87ef0e944c7336/merged": directory not empty

I cannot delete the container for the same reason:

buildah rm $container

Error: removing container "working-container": deleting build container "c5034a07d23849341d1c97f1d9a0902234579fb60624959f53c614d22970c801": replacing mount point "/home/skrzynia/.local/share/containers/storage/overlay/c2387345ed68d84ad6c67b915b4784e40317ca97a0522ea75e87ef0e944c7336/merged": directory not empty

Describe the results you expected

I should have been able to unmout or delete the container

buildah version output

buildah version 1.41.5 (image-spec 1.1.0, runtime-spec 1.2.1)

buildah info output

{
    "host": {
        "CgroupVersion": "v2",
        "Distribution": {
            "distribution": "debian",
            "version": "unknown"
        },
        "MemFree": 1781284864,
        "MemTotal": 2066771968,
        "OCIRuntime": "crun",
        "SwapFree": 1154478080,
        "SwapTotal": 1154478080,
        "arch": "amd64",
        "cpus": 2,
        "hostname": "debian14",
        "kernel": "6.16.12+deb14+1-amd64",
        "os": "linux",
        "rootless": true,
        "uptime": "13m 34.34s",
        "variant": ""
    },
    "store": {
        "ContainerStore": {
            "number": 1
        },
        "GraphDriverName": "overlay",
        "GraphOptions": null,
        "GraphRoot": "/home/skrzynia/.local/share/containers/storage",
        "GraphStatus": {
            "Backing Filesystem": "extfs",
            "Native Overlay Diff": "true",
            "Supports d_type": "true",
            "Supports shifting": "false",
            "Supports volatile": "true",
            "Using metacopy": "false"
        },
        "ImageStore": {
            "number": 0
        },
        "RunRoot": "/run/user/1001/containers"
    }
}

Provide your storage.conf

# This file is the configuration file for all tools
# that use the containers/storage library. The storage.conf file
# overrides all other storage.conf files. Container engines using the
# container/storage library do not inherit fields from other storage.conf
# files.
#
#  Note: The storage.conf file overrides other storage.conf files based on this precedence:
#      /usr/containers/storage.conf
#      /etc/containers/storage.conf
#      $HOME/.config/containers/storage.conf
#      $XDG_CONFIG_HOME/containers/storage.conf (if XDG_CONFIG_HOME is set)
# See man 5 containers-storage.conf for more information
# The "storage" table contains all of the server options.
[storage]

# Default storage driver, must be set for proper operation.
driver = "overlay"

# Temporary storage location
runroot = "/run/containers/storage"

# Priority list for the storage drivers that will be tested one
# after the other to pick the storage driver if it is not defined.
# driver_priority = ["overlay", "btrfs"]

# Primary Read/Write location of container storage
# When changing the graphroot location on an SELinux system, you must
# ensure the labeling matches the default location's labels with the
# following commands:
# semanage fcontext -a -e /var/lib/containers/storage /NEWSTORAGEPATH
# restorecon -R -v /NEWSTORAGEPATH
graphroot = "/var/lib/containers/storage"

# Optional alternate location of image store if a location separate from the
# container store is required. If set, it must be different than graphroot.
# imagestore = ""


# Storage path for rootless users
#
# rootless_storage_path = "$HOME/.local/share/containers/storage"

# Transient store mode makes all container metadata be saved in temporary storage
# (i.e. runroot above). This is faster, but doesn't persist across reboots.
# Additional garbage collection must also be performed at boot-time, so this
# option should remain disabled in most configurations.
# transient_store = true

[storage.options]
# Storage options to be passed to underlying storage drivers

# AdditionalImageStores is used to pass paths to additional Read/Only image stores
# Must be comma separated list.
additionalimagestores = [
]

# Options controlling how storage is populated when pulling images.
[storage.options.pull_options]
# Enable the "zstd:chunked" feature, which allows partial pulls, reusing
# content that already exists on the system. This is disabled by default,
# and must be explicitly enabled to be used. For more on zstd:chunked, see
# https://github.com/containers/storage/blob/main/docs/containers-storage-zstd-chunked.md
# This is a "string bool": "false" | "true" (cannot be native TOML boolean)
# enable_partial_images = "false"

# Tells containers/storage to use hard links rather then create new files in
# the image, if an identical file already existed in storage.
# This is a "string bool": "false" | "true" (cannot be native TOML boolean)
# use_hard_links = "false"

# Path to an ostree repository that might have
# previously pulled content which can be used when attempting to avoid
# pulling content from the container registry.
# ostree_repos=""

# If set to "true", containers/storage will convert images that are
# not already in zstd:chunked format to that format before processing
# in order to take advantage of local deduplication and hard linking.
# It is an expensive operation so it is not enabled by default.
# This is a "string bool": "false" | "true" (cannot be native TOML boolean)
# convert_images = "false"

# This should ALMOST NEVER be set.
# It allows partial pulls of images without guaranteeing that "partial
# pulls" and non-partial pulls both result in consistent image contents.
# This allows pulling estargz images and early versions of zstd:chunked images;
# otherwise, these layers always use the traditional non-partial pull path.
#
# This option should be enabled EXTREMELY rarely, only if ALL images that could
# EVER be conceivably pulled on this system are GUARANTEED (e.g. using a signature policy)
# to come from a build system trusted to never attack image integrity.
#
# If this consistency enforcement were disabled, malicious images could be built
# in a way designed to evade other audit mechanisms, so presence of most other audit
# mechanisms is not a replacement for the above-mentioned need for all images to come
# from a trusted build system.
#
# As a side effect, enabling this option will also make image IDs unpredictable
# (usually not equal to the traditional value matching the config digest).
# insecure_allow_unpredictable_image_contents = "false"

# Root-auto-userns-user is a user name which can be used to look up one or more UID/GID
# ranges in the /etc/subuid and /etc/subgid file.  These ranges will be partitioned
# to containers configured to create automatically a user namespace.  Containers
# configured to automatically create a user namespace can still overlap with containers
# having an explicit mapping set.
# This setting is ignored when running as rootless.
# root-auto-userns-user = "storage"
#
# Auto-userns-min-size is the minimum size for a user namespace created automatically.
# auto-userns-min-size=1024
#
# Auto-userns-max-size is the maximum size for a user namespace created automatically.
# auto-userns-max-size=65536

[storage.options.overlay]
# ignore_chown_errors can be set to allow a non privileged user running with
# a single UID within a user namespace to run containers. The user can pull
# and use any image even those with multiple uids.  Note multiple UIDs will be
# squashed down to the default uid in the container.  These images will have no
# separation between the users in the container. Only supported for the overlay
# and vfs drivers.
# This is a "string bool": "false" | "true" (cannot be native TOML boolean)
#ignore_chown_errors = "false"

# Inodes is used to set a maximum inodes of the container image.
# inodes = ""

# Path to an helper program to use for mounting the file system instead of mounting it
# directly.
#mount_program = "/usr/bin/fuse-overlayfs"

# mountopt specifies comma separated list of extra mount options
mountopt = "nodev"

# Set to skip a PRIVATE bind mount on the storage home directory.
# This is a "string bool": "false" | "true" (cannot be native TOML boolean)
# skip_mount_home = "false"

# Set to use composefs to mount data layers with overlay.
# This is a "string bool": "false" | "true" (cannot be native TOML boolean)
# use_composefs = "false"

# Size is used to set a maximum size of the container image.
# size = ""

# ForceMask specifies the permissions mask that is used for new files and
# directories.
#
# The values "shared" and "private" are accepted.
# Octal permission masks are also accepted.
#
#  "": No value specified.
#     All files/directories, get set with the permissions identified within the
#     image.
#  "private": it is equivalent to 0700.
#     All files/directories get set with 0700 permissions.  The owner has rwx
#     access to the files. No other users on the system can access the files.
#     This setting could be used with networked based homedirs.
#  "shared": it is equivalent to 0755.
#     The owner has rwx access to the files and everyone else can read, access
#     and execute them. This setting is useful for sharing containers storage
#     with other users.  For instance have a storage owned by root but shared
#     to rootless users as an additional store.
#     NOTE:  All files within the image are made readable and executable by any
#     user on the system. Even /etc/shadow within your image is now readable by
#     any user.
#
#   OCTAL: Users can experiment with other OCTAL Permissions.
#
#  Note: The force_mask Flag is an experimental feature, it could change in the
#  future.  When "force_mask" is set the original permission mask is stored in
#  the "user.containers.override_stat" xattr and the "mount_program" option must
#  be specified. Mount programs like "/usr/bin/fuse-overlayfs" present the
#  extended attribute permissions to processes within containers rather than the
#  "force_mask"  permissions.
#
# force_mask = ""

Upstream Latest Release

No

Additional environment details

Additional environment details

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

a-skr avatar Nov 06 '25 09:11 a-skr

buildah version output:

Version:         1.41.5
Go Version:      go1.24.9
Image Spec:      1.1.0
Runtime Spec:    1.2.1
CNI Spec:        1.0.0
libcni Version:
image Version:   5.36.2
Git Commit:
Built:           Thu Jan  1 01:00:00 1970
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64

a-skr avatar Nov 06 '25 11:11 a-skr

container=$(buildah from scratch) mountpoint=$(buildah unshare buildah mount $container) cd $mountpoint touch HELLO cd -

This is incorrect usage. You may have a misunderstanding what buildah unshare is for. Rootless buildah/podman work within their own user and mount namespace. As such all mounts it does are only visible in that namespace. A rootless process does not have the privileges to make mounts appear on the host namespace.

As such you script does never write into the container, all the writes must all happen inside buildah unshare because the mount only exists there. Writing on the host means you effectively bypass our storage abstractions and do direct writes into the directory location which is simply not something we can support.

The reason it works with vfs is because vfs doesn't do mounts. It only uses normal directories and copies files so the view on the host and in the namespace are the same (except the idmappings of course).

I guess the only question is if we should catch that case and empty the merged dir if we get a ENOTEMPTY on the rename here, @giuseppe @nalind @mtrmac WDYT? Not being able to delete the container in that case seems bad.

Luap99 avatar Nov 06 '25 11:11 Luap99

Thank you for your explanation.

I missed the fact a mount namespace was created too.

I updated the test as follow:

container=$(buildah from scratch)                           # 1.
mountpoint=$(buildah unshare buildah mount $container)      # 2. mount
buildah unshare buildah mount                               # 3. reports container as mounted
buildah unshare touch $mountpoint/HELLO                     # 4. should not work because mount ns from 2 is destroyed?
buildah unshare buildah umount $container                   # 5. cannot umount, directory merged not empty

I still have the same umount issue.

Am I right to assume the mount namespace from line 2 is destroyed after unshare returns (so the line 4 is executed in a whole new mount namespace where no mount has been performed, and touch command still bypass the storage abstraction)?

(note that I know I can solve the issue by injecting a script to buildah unshare, or by using the --mount option. I just want to better understand how buildah works)

If my last statement is correct, I see a minor issue here, because buildah unshare buildah mount reports the container to be mounted at line 3, but the mountpoint is not accessible at line 4. This is somewhat misleading.

I think the manpages should be updated accordingly:

  • buildah-unshare: there is currently no mention of the mount namespace.
  • buildah-mount: maybe an update to clarify what buildah mount without arguments reports exactly when rootless overlay is used?

And about your last question: if the merged directory is not empty, it's probably because a user made the same mistake as me, so it's probably better to keep current behaviour (for convenience, you may consider adding a --force flag to enable container destruction in such cases).

a-skr avatar Nov 06 '25 14:11 a-skr

Am I right to assume the mount namespace from line 2 is destroyed after unshare returns (so the line 4 is executed in a whole new mount namespace where no mount has been performed, and touch command still bypass the storage abstraction)?

Yes right, the namespace only exist for the duration of the unshare command, the next unshare command creates a different one. So yes the gernal advice is to run the entire script under unshare.

This is different from podman which also has a unshare command but podman keeps the namespace alive and knows its state. So another possible work around would be to use the podman namespace. If you chain several podman unshare command they do end up as part of the same namespace.

If my last statement is correct, I see a minor issue here, because buildah unshare buildah mount reports the container to be mounted at line 3, but the mountpoint is not accessible at line 4. This is somewhat misleading.

Right, I am not familiar with buildah internals but it seems it stores the mounted state in a file and does not consult the real mount table of the namespace thus the missmatch. In general mixing namespaces and the file states is not good. This is one of the reasons why podman uses a persistent namespace for all its commands.

I think the manpages should be updated accordingly:

buildah-unshare: there is currently no mention of the mount namespace. buildah-mount: maybe an update to clarify what buildah mount without arguments reports exactly when rootless overlay is used?

Sure PRs welcome.

And about your last question: if the merged directory is not empty, it's probably because a user made the same mistake as me, so it's probably better to keep current behaviour (for convenience, you may consider adding a --force flag to enable container destruction in such cases).

This happens very far down in the stack but it should be possible to do this in theory.

Luap99 avatar Nov 06 '25 15:11 Luap99

I guess the only question is if we should catch that case and empty the merged dir if we get a ENOTEMPTY on the rename here, @giuseppe @nalind @mtrmac WDYT?

I think if we are in such an unexpected state, stopping and doing nothing is much safer than starting to delete files with unknown contents and provenance. E.g. this issue would not be filed, and the misunderstanding not noticed, if we silently deleted the files.

mtrmac avatar Nov 06 '25 18:11 mtrmac

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Dec 07 '25 00:12 github-actions[bot]