podman
podman copied to clipboard
Throughput reported by 'podman push' is too high
Issue Description
When pushing an image to quay.io, podman reports unrealistic throughput numbers:
$ podman push quay.io/foo/bar:zoo
Getting image source signatures
Copying blob 40d0b5be75be done |
Copying blob e2604a9fd705 [==================================>---] 86.0MiB / 92.9MiB | 350.0 MiB/s
Copying blob 2e29575f0513 [==========================>-----------] 94.0MiB / 130.6MiB | 384.9 MiB/s
I've seen it report up 1.1 GiB/s, but failed to capture. My upload bandwidth is 20 Mbps as reported by https://speedtest.xfinity.com/results so these numbers are not realistic, something is off with arithmetic there.
Steps to reproduce the issue
Steps to reproduce the issue
- Measure upload bandwidth
- Push a somewhat large, > 100 MB, image, to an image registry, watch the throughput as reported by podman.
- Compare the throughput as reported by podman to the available upload bandwidth
Describe the results you received
podman reported image upload speed of 350.0 MiB/s
Describe the results you expected
podman reports image upload speed within my cable modem's upload bandwidth
podman info output
$ podman info
host:
arch: amd64
buildahVersion: 1.32.0
cgroupControllers:
- cpu
- io
- memory
- pids
cgroupManager: systemd
cgroupVersion: v2
conmon:
package: conmon-2.1.7-3.fc39.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.1.7, commit: '
cpuUtilization:
idlePercent: 95.32
systemPercent: 0.66
userPercent: 4.02
cpus: 16
databaseBackend: boltdb
distribution:
distribution: fedora
variant: workstation
version: "39"
eventLogger: journald
freeLocks: 2048
hostname: fedora
idMappings:
gidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
uidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
kernel: 6.5.10-300.fc39.x86_64
linkmode: dynamic
logDriver: journald
memFree: 10782474240
memTotal: 67100966912
networkBackend: netavark
networkBackendInfo:
backend: netavark
dns:
package: aardvark-dns-1.8.0-1.fc39.x86_64
path: /usr/libexec/podman/aardvark-dns
version: aardvark-dns 1.8.0
package: netavark-1.8.0-2.fc39.x86_64
path: /usr/libexec/podman/netavark
version: netavark 1.8.0
ociRuntime:
name: crun
package: crun-1.11.1-1.fc39.x86_64
path: /usr/bin/crun
version: |-
crun version 1.11.1
commit: 1084f9527c143699b593b44c23555fb3cc4ff2f3
rundir: /run/user/1000/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
os: linux
pasta:
executable: /usr/bin/pasta
package: passt-0^20231004.gf851084-1.fc39.x86_64
version: |
pasta 0^20231004.gf851084-1.fc39.x86_64
Copyright Red Hat
GNU General Public License, version 2 or later
<https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
remoteSocket:
exists: false
path: /run/user/1000/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: true
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: true
serviceIsRemote: false
slirp4netns:
executable: /usr/bin/slirp4netns
package: slirp4netns-1.2.2-1.fc39.x86_64
version: |-
slirp4netns version 1.2.2
commit: 0ee2d87523e906518d34a6b423271e4826f71faf
libslirp: 4.7.0
SLIRP_CONFIG_VERSION_MAX: 4
libseccomp: 2.5.3
swapFree: 8587571200
swapTotal: 8589930496
uptime: 4h 52m 31.00s (Approximately 0.17 days)
plugins:
authorization: null
log:
- k8s-file
- none
- passthrough
- journald
network:
- bridge
- macvlan
- ipvlan
volume:
- local
registries:
search:
- registry.fedoraproject.org
- registry.access.redhat.com
- docker.io
- quay.io
store:
configFile: /home/syedriko/.config/containers/storage.conf
containerStore:
number: 0
paused: 0
running: 0
stopped: 0
graphDriverName: overlay
graphOptions: {}
graphRoot: /home/syedriko/.local/share/containers/storage
graphRootAllocated: 1022488477696
graphRootUsed: 346458644480
graphStatus:
Backing Filesystem: btrfs
Native Overlay Diff: "true"
Supports d_type: "true"
Supports shifting: "false"
Supports volatile: "true"
Using metacopy: "false"
imageCopyTmpDir: /var/tmp
imageStore:
number: 67
runRoot: /run/user/1000/containers
transientStore: false
volumePath: /home/syedriko/.local/share/containers/storage/volumes
version:
APIVersion: 4.7.0
Built: 1695838680
BuiltTime: Wed Sep 27 14:18:00 2023
GitCommit: ""
GoVersion: go1.21.1
Os: linux
OsArch: linux/amd64
Version: 4.7.0
Podman in a container
No
Privileged Or Rootless
Rootless
Upstream Latest Release
No
Additional environment details
Additional environment details
Additional information
Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting
does it show it constantly off or is it a temporary peak value?
does it show it constantly off or is it a temporary peak value?
The values change, I've seen them fluctuate between 300 MiB/s and 1.1 GiB/s, but the entire range is unrealistic.
I see the same issue with Podman 4.7:
Copying blob 21b98c2cb85e [========>-----------------------------] 25.0MiB / 100.0MiB | 1.3 GiB/s
I think we introduced the change in https://github.com/containers/image/pull/2015. Since pull stats are fine, I assume that we calculate something different when doing an image push.
Buildah is affected as well, so I think this one is a more general issue.
I see the same. It's reporting over 100MiB/s for a large image. My internet upload is only ~40 MB/s for upload.
Unfortunately I had no further time spent on this. @mtrmac do you have a rough guess where the issue may come from?
I didn’t investigate in detail now, just some quick points:
- The bandwidth is measured in terms of the source stream form; so on push, it is measuring the speed to read uncompressed data from c/storage. The code is does not track data allowing to decide that the “throughput that matters” is the network one, not the disk one, and to specifically measure at one or the other end of the copy pipeline. On slow links with fast CPUs (frequently compression is the slowest part), this could easily explain a factor of 2 or so, depending on the achieved compression ratio.
- In principle, over time, the total throughput when reading from the source should (after buffers fill up) be limited by the throughput of compression and I/O writes, but it might be the case that this is measuring “instantaneous” throughput on reads from the source, without taking into account the time the copy is waiting for the destination writes, and not triggering any reads from the source at all. I don’t immediately know what the
mpbprogress bar code is doing, and possibly whether the computation is tunable (to be smoothed, or to explicitly take / not take unrelated waits into account).
@mtrmac both of your points are correct. In short this is happening because ewma based decorator is being used which goal is to measure “instantaneous” throughput. You just hit an edge case as I'm pretty sure it works ok with pull (download) and not ok with push (upload). Most of the time download operation involves slow reader and upload slow writer. It means:
ewmawith ProxyReader combo works ok on download (slow reader)ewmawith ProxyReader combo works not ok on upload (slow writer)
You have several options to fix this. The most quick and simple is to use average based decorators instead. But if you'd like to maintain ewma based one so here are the options:
- Opt in to ProxyWriter on push operations or simulate it with io.Pipe in order to keep just
ProxyReaderaround. - Tune
ewmaweight value. The greater the value is the more will measurement behave like an average one.
You don't have to pick just one approach. I would handle push case with ProxyWriter and use 90 as ewma weight value instead of 30 (but you'd better check end result and tune accordingly). Underlying ewma lib treat 30 as special case which defaults to no need for WARMUP_SAMPLES variant.
Last but not least, manual bar increments require use of Ewma prefixed variants in order for ewma based decorators to work correctly (ProxyReader and ProxyWriter handle this automatically).
@vbauerster Thank you very much for the advice!
To reformulate in simple terms: reading the ProgressReader implementation, the throughput being recorded is only the time spent in Read(). So, in the ”slow consumer” situation, we spend a lot of time between each Read call and very little time inside the Read call.
In that case, the throughput computed is not just “within a factor of 2 or so”, but it can be many times the bandwidth of the link. I think that’s the full explanation, with no mysteries left.
Our pipeline design is such that the pipeline is actually executed “inside the destination”, and ultimately the slow write calls are only observed by that destination. In particular, pushes use the Go HTTP stack, where the input is provided as an io.Reader; we don’t have access to the Write-like operations that happen on the network socket deep inside http.Transport. That seems pretty hard to integrate with the ProxyWriter code measuring time spent.
One possible ~general solution for us might be to replace ProxyReader with a custom implementation which does not measure just the time inside Read, but all of the time since the previous Read finished. That would, I think, end up reporting the effective throughput of the pipeline — but still in the terms of the uncompressed input, so within a factor or ~2 of the network throughput.
An additional complication for push is that we know the size of the input, but we don’t know the size of the to-be-compressed output. So we would, ideally, want to report the progress percentage in terms of the uncompressed data, but throughput in terms of the compressed data — both on one progress bar line.
I’m sure that can be built (completely from scratch, or maybe by somehow providing a custom BarFiller), but it’s not something I plan to immediately work on.
Last but not least, manual bar increments require use of
Ewmaprefixed variants in order for ewma based decorators to work correctly (ProxyReader and ProxyWriter handles this automatically).
Thanks! https://github.com/containers/image/pull/2308 should fix that.
To reformulate in simple terms: reading the
ProgressReaderimplementation, the throughput being recorded is only the time spent inRead(). So, in the ”slow consumer” situation, we spend a lot of time between eachReadcall and very little time inside theReadcall.
Correct, the opposite is true for ProxyWriter implementation where time spent only in Write([]byte) is recorded.
One possible ~general solution for us might be to replace
ProxyReaderwith a custom implementation which does not measure just the time insideRead, but all of the time since the previousReadfinished. That would, I think, end up reporting the effective throughput of the pipeline — but still in the terms of the uncompressed input, so within a factor or ~2 of the network throughput.
If you don't have easy access to a writer then such ~general solution is very plausible. Worth to mention one more time that average decorators just work as expected most of the time.
I think if we spend 10 minutes uploading a layer, and then the upload stalls, we would want to converge to showing a 0 throughput reasonably quickly; not reporting half the throughput after another 10 minutes, quarter after another 20 minutes.
But then again, in most situations pushes complete well within a minute, or maybe within a few; so that might not be worth worrying about too much.
I think if we spend 10 minutes uploading a layer, and then the upload stalls, we would want to converge to showing a 0 throughput reasonably quickly; not reporting half the throughput after another 10 minutes, quarter after another 20 minutes. But then again, in most situations pushes complete well within a minute, or maybe within a few; so that might not be worth worrying about too much.
From a mildly-informed onlooker's perspective who noticed the currently busted progress indications, this is indeed an edge case, but it's also important. When upload pace is consistent, the algorithm hardly matters - ewma, average, etc will all look the same for constant inputs.
The biggest value of a progress bar is for longer operations (many minutes) to reflect an overall change in the shape of the operation - "the first half went at 10MB/s and the second half is now at 200KB/s" - so while I appreciate that there is only finite time to spend on niceties like this, I do not think you're worrying about it too much at all, since these edge cases are where the updating display is most useful :)
In that case, the throughput computed is not just “within a factor of 2 or so”, but it can be many times the bandwidth of the link. I think that’s the full explanation, with no mysteries left.
Agreed. I see a fairly constant 800 MiB/s - 1.2 GiB/s value for a 272MiB layer that has been in the process of pushing for 3 Minutes before it breaks 150MiB transfered.
Glad I found this ticket, I was starting to think I was going a little stir crazy :smile:
I'm seeing order of magnitude issues. I'm seeing 1.2GiB/s when in reality the network stack is running at 12Mbps (~1.1MiB/s).
podman-4.9.3-1.fc39.x86_64
Looks to be a factor of ~1024, built from b8a684b64d77fc97b3b539dd5891319756c15c39
The same issue (wrong and fluctuating throughput) is also present for pulls. Fixing one might likely also fix the other, but I wanted to make a note of it.
I'm seeing the same issue with podman version 5.2.2 on Arch Linux when pushing an image. It roughly looks like the reported speed is a factor of 1024 to large (non-accurate estimation).
Just hit this with podman 5.3.1
Copying blob fad12fca6ab3 [=====>-----------------------------] 1.5GiB / 9.0GiB | 739.6 MiB/s
I wish my connection was this fast.
I'm seeing the same issue; ~1024x too fast (it's reporting 1.4GiB/s when it should be around that in MiB/s). The number goes up and down a little bit as it progresses so it seems to be recalculating sensibly, apart from being off by this factor.
This is podman 5.0.3 on Ubuntu Oracular.
These two layers took multiple minutes to upload - if the speed estimate was accurate, it would have been done in seconds.