storage `podman images` slow with `--uidmap`

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

We run a single ~6 GB podman image with multiple containers, each rootfull and with a unique --uidmap & --gidmap. When running many (~20-30) such containers podman images command becomes very slow. After a quick look with strace and some other debugging, I believe most of the time spent is to calculate the image size. Unfortunately there is no way to skip that, even with --quiet.

Steps to reproduce the issue:

Reset podman system:

$ sudo podman system reset

WARNING! This will remove:
        - all containers
        - all pods
        - all images
        - all build cache
Are you sure you want to continue? [y/N] y
A storage.conf file exists at /etc/containers/storage.conf
You should remove this file if you did not modified the configuration.

Pull a large image and , run podman images -a:

$ sudo podman pull example.com/repo/image
...
$ time sudo podman images -a
REPOSITORY              TAG         IMAGE ID      CREATED       SIZE
example.com/repo/image  latest    55e255254b1f  35 hours ago  6.33 GB

real	0m0.160s
user	0m0.040s
sys	0m0.048s

Start 30 containers, each with unique --uidmap

$ for i in {1..30}; do sudo podman run -d --uidmap 0:$i:100000 example.com/repo/image; done
...
$ time sudo podman images -a
REPOSITORY              TAG         IMAGE ID      CREATED       SIZE
example.com/repo/image  latest    55e255254b1f  35 hours ago  23.1 GB

real	1m0.881s
user	0m37.914s
sys	0m43.773s

Remove all containers (also very slow...):

$ time sudo podman rm -f $(sudo podman ps -aq) > /dev/null
real	5m9.833s
user	0m1.020s
sys	0m1.159s
$ time sudo podman images -a
REPOSITORY              TAG         IMAGE ID      CREATED       SIZE
example.com/repo/image  latest    55e255254b1f  35 hours ago  23.1 GB

real	0m57.164s
user	0m36.532s
sys	0m41.995s

Remove the image (Even though there are no containers or images at this point, podman still uses lots of space, see table below. But podman images -a is fast):

$ time sudo podman rmi $(sudo podman images -aq)
Untagged: example.com/repo/image:latest
Deleted: 55e255254b1f146a3857e9e57c7c9f1d8fc5c8be8e26e32f475081885d8fa23f

real	1m20.925s
user	0m50.986s
sys	0m59.144s
$ time sudo podman images -a
REPOSITORY  TAG         IMAGE ID    CREATED     SIZE

real	0m0.162s
user	0m0.033s
sys	0m0.042s

Redownload last image (note that all except last layer already exist even though I've just deleted the last image in the previous step):

$ sudo podman pull example.com/repo/image:latest
Trying to pull example.com/repo/image:latest...
Getting image source signatures
Copying blob 273b8b71b7b6 skipped: already exists  
Copying blob 5f4a79f41734 skipped: already exists  
Copying blob 1d7e57823380 skipped: already exists  
Copying blob 094dd4168f45 skipped: already exists  
Copying blob 1da4ce7e5083 skipped: already exists  
Copying blob 66934d8f93e1 skipped: already exists  
Copying blob 933e13ee990c skipped: already exists  
Copying blob 7ef04668fb37 skipped: already exists  
Copying blob 6f9c51b8f8b2 skipped: already exists  
Copying blob d2f3b5997ad1 skipped: already exists  
Copying blob e0f5f3dbfa53 skipped: already exists  
Copying blob 1f5c8166c3ba skipped: already exists  
Copying blob 6862b881cb80 skipped: already exists  
Copying blob 33c7276c4f03 skipped: already exists  
Copying blob c53545616dfe skipped: already exists  
Copying blob 7d8d70253d88 skipped: already exists  
Copying blob 670c55e249d5 skipped: already exists  
Copying blob b6436833b837 done  
Copying config 55e255254b done  
Writing manifest to image destination
Storing signatures
55e255254b1f146a3857e9e57c7c9f1d8fc5c8be8e26e32f475081885d8fa23f

Delete the image again:

$ time sudo podman rmi $(sudo podman images -aq)
Untagged: example.com/repo/image:latest
Deleted: 55e255254b1f146a3857e9e57c7c9f1d8fc5c8be8e26e32f475081885d8fa23f

real	0m1.825s
user	0m0.730s
sys	0m1.219s

I've also collected a few stats at the end of each of steps above:

# nodes: sudo find /var/lib/containers/storage/ | wc -l
du size: sudo du -skh /var/lib/containers/storage/
Δ df size: Difference in used size from df on the filesystem that has /var/lib/containers/storage/

Step	#FS nodes	`du` size	Δ `df` size
1	-	-	-
2	55000	6.0G	+6.2G
3	3221487	179G	+0.9G
4	1714063	17G	-29M
5	51137	5.5G	-1.6G
6	55001	6.0G	+0.7G
7	21	660K	-6.2G

Describe the results you expected: There needs to be a way to list images without size information or at least optimize it, my understanding is that the image is written only once, the other "copies" are just mounts over the original storage with ShiftFS. So it should be enough to stat the real image files once and skip the ShiftFS mounts?

Another issue I discovered while collecting the data above; Why are nearly all layers of the original image still there after I remove the image in step 5?

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.21.3
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.29-1.module+el8.4.0+11822+6cc1e7d7.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.29, commit: ae467a0c8001179d4d0adf4ada381108a893d7ec'
  cpus: 24
  distribution:
    distribution: '"rhel"'
    version: "8.4"
  eventLogger: file
  hostname: xxx
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 4.18.0-305.19.1.el8_4.x86_64
  linkmode: dynamic
  memFree: 3411726336
  memTotal: 50383847424
  ociRuntime:
    name: crun
    path: /usr/bin/crun
    version: |-
      crun version 1.0
      commit: 139dc6971e2f1d931af520188763e984d6cdfbf8
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 0
  swapTotal: 0
  uptime: 71h 4m 13.08s (Approximately 2.96 days)
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /data/containers/graph
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageStore:
    number: 1
  runRoot: /data/containers/run
  volumePath: /data/containers/graph/volumes
version:
  APIVersion: 3.2.3
  Built: 1627370979
  BuiltTime: Tue Jul 27 07:29:39 2021
  GitCommit: ""
  GoVersion: go1.15.7
  OsArch: linux/amd64
  Version: 3.2.3

Package info (e.g. output of rpm -q podman or apt list podman):

podman-3.2.3-0.10.module+el8.4.0+11989+6676f7ad.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

Yes (with the latest available on RHEL 8.4 (podman 3.2.3))

Nov 11 '21 11:11 freva

@vrothberg PTAL, I know you've been doing some work in this area.

@freva Any chance you can try with a more-recent Podman (3.4.x ideally)? We've definitely made some improvements already in this area.

Nov 11 '21 14:11 mheon

Thanks for the ping, @mheon!

I recently did some major improvement to speed-up image listing. But those improvement only apply when there is more than one image. In this case, we have exactly one image and the performance degrades with the number of containers using that image.

$ time sudo podman images -a
REPOSITORY              TAG         IMAGE ID      CREATED       SIZE
example.com/repo/image  latest    55e255254b1f  35 hours ago  23.1 GB

That is very suspicious as the the image has been listed initially to be 6.33 GB. It seems like something's going on in storage. The more containers use the image, the more expensive it gets to calculate the total storage consumption. But I think that container shouldn't play any role here at all.

Nov 12 '21 09:11 vrothberg

Not sure if I will find time today but I will next week.

Nov 12 '21 10:11 vrothberg

@giuseppe PTAL

The reproducer works reliably on my machine as well. Looks c/storage related to me.

Nov 12 '21 12:11 vrothberg

this problem will go away once we move to idmapped mounts

Nov 12 '21 13:11 giuseppe

this happens because everytime you use a different mapping, c/storage needs to clone the image and chown it, effectively creating a new image.

So even if there is one image visible, in reality there are multiple images in the storage and we calculate the size for them.

AFAICS, the cost grows linearly with how many images are in the storage (even if they are not visible with podman images).

I'd say it is better to just wait for the problem to be solved once idmapped mounts work well with overlay that adding more heuristics to c/storage.

Nov 12 '21 13:11 giuseppe

That makes sense, thank you, @giuseppe!

Do you have a rough ETA on when idmapped mounts will arrive?

Nov 12 '21 13:11 vrothberg

We have been told that they hope to have the kernel fixed by end of year. IDMapps do not work with Overlay at this point in the kernel

Nov 12 '21 15:11 rhatdan

Shall we make showing the size optional? It's expensive in any case but Docker displays it by default.

Nov 12 '21 15:11 vrothberg

Yes I think we should make it optional.

Nov 12 '21 15:11 rhatdan

And then have a containers.conf flag, if people want to match Docker behaviour.

Nov 12 '21 15:11 rhatdan

Can the size be calculated once during image pull/import/creation/clone and stored instead of being calculated during each query? I am not aware of a reason an image would change without its Id also changing.

May 08 '23 06:05 ykuksenko

That would make sense, @giuseppe WDYT?

May 08 '23 14:05 rhatdan

that seems to make sense to me, I am wondering why we currently don't do that. It might be because with images copied with metacopy the size is different since files are empty

May 08 '23 14:05 giuseppe

Doesn't an image that is copied with metacopy become a container and therefore is not actually listable with podman image ls?

Now that I think of it, the command to get a container size requires the --size flag (podman ps --format=json --size).

This makes sense to me, because it is an expensive operation. There is also not a good way to calculate the size of a container in advance as it is mutable.
An image on the other hand should not be mutable so should be pre-computable.
- If an image was mutable the Id would not have much meaning, at least to me.
If an image shares layers with other images that would be more of a size on disk number or an optimization rather than an image size.
- This may be covered by a podman system df command or possibly by SharedSize from podman images --format=json -a. My systems just show 0 for SharedSize for all my images. Not sure what else would cover this area.
- metacopy could fall under being an optimization (size on disk) with this logic.

May 08 '23 18:05 ykuksenko

We have been told that they hope to have the kernel fixed by end of year. IDMapps do not work with Overlay at this point in the kernel

A few years on, do idmapps work with the overlay driver on Linux 6.x? If not, would it be better to use the fuse-overlayfs to get the perf back?

Aug 23 '23 13:08 Ramblurr

idmap works for rootful not for rootless.

Aug 23 '23 19:08 rhatdan

idmap works for rootful not for rootless.

@rhatdan Is this still the case for overlay in modern kernel versions? what's the reason behind that?

Nov 03 '23 19:11 gr0l0rg

I actually tried mounting overlay inside podman unshare and it worked perfectly. But somehow rootless podman pulls are unbearably slow (usually 2x time compared to rootful on same machine) especially for bigger images.

Here is the mount info inside the userns via podman unshare

overlay /tmp/a01/merged overlay rw,relatime,lowerdir=/tmp/a01/lower,upperdir=/tmp/a01/upper,workdir=/tmp/a01/work,redirect_dir=nofollow,index=off,metacopy=off 0 0

My kernel is v6.5.5 and podman is 4.7.1

Nov 03 '23 19:11 gr0l0rg

Yes, We are working on podman pull being able to be done within the user namespace. But for now, the kernel does not allow idmaping in rootless user namespace.

Nov 05 '23 12:11 rhatdan

storage storage copied to clipboard

`podman images` slow with `--uidmap`

storage
storage copied to clipboard