podman
podman copied to clipboard
WIP: run e2e test on tmpfs
Follow up to commit eaf60c7fe7, lets see how bad things are going to break.
Does this PR introduce a user-facing change?
None
Ephemeral COPR build failed. @containers/packit-build please check.
Error logs are misleading. Actual error seems to be in BeforeEtc
:
# podman [options] load -q -i /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar
Error: payload does not match any of the supported image formats:
* oci: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/index.json: not a directory
* oci-archive: loading index: open /var/tmp/container_images_oci1129938589/index.json: no such file or directory
* docker-archive: writing blob: adding layer with blob "sha256:5067dfc06cd159373bab350ebcb8986dd729e64324592b6f28bbcdb6fc5efda0": processing tar file(write /usr/lib64/dri/nouveau_drv_video.so: no space left on device): exit status 1
* dir: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/manifest.json: not a directory
Error logs are misleading. Actual error seems to be in
BeforeEtc
:# podman [options] load -q -i /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar Error: payload does not match any of the supported image formats: * oci: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/index.json: not a directory * oci-archive: loading index: open /var/tmp/container_images_oci1129938589/index.json: no such file or directory * docker-archive: writing blob: adding layer with blob "sha256:5067dfc06cd159373bab350ebcb8986dd729e64324592b6f28bbcdb6fc5efda0": processing tar file(write /usr/lib64/dri/nouveau_drv_video.so: no space left on device): exit status 1 * dir: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/manifest.json: not a directory
Yeah I think the first step is to remove the big images we use, the image dir is already over 1.1G so we should get this down as first step. Maybe we can get away with 4 Gb RAM then. Also why the hell did this even run if the setup failed?! That needs fixing too.
This maybe?
diff --git a/test/e2e/common_test.go b/test/e2e/common_test.go
index bc1f73b82..f617c3a58 100644
--- a/test/e2e/common_test.go
+++ b/test/e2e/common_test.go
@@ -1027,6 +1027,7 @@ func (p *PodmanTestIntegration) RestoreArtifactToCache(image string) error {
p.Root = p.ImageCacheDir
restore := p.PodmanNoEvents([]string{"load", "-q", "-i", tarball})
restore.WaitWithDefaultTimeout()
+ Expect(restore).To(ExitCleanly())
}
return nil
}
This maybe?
diff --git a/test/e2e/common_test.go b/test/e2e/common_test.go index bc1f73b82..f617c3a58 100644 --- a/test/e2e/common_test.go +++ b/test/e2e/common_test.go @@ -1027,6 +1027,7 @@ func (p *PodmanTestIntegration) RestoreArtifactToCache(image string) error { p.Root = p.ImageCacheDir restore := p.PodmanNoEvents([]string{"load", "-q", "-i", tarball}) restore.WaitWithDefaultTimeout() + Expect(restore).To(ExitCleanly()) } return nil }
yes
Q: why did debian pass? A: debian VM images do not have tmpfs mounted at /tmp
A: debian VM images do not have tmpfs mounted at /tmp
Yeah... I noticed that while playing with my fedora tmpfs changes. AFAICT that is the Debian default, not something we do in VM creation, so I left it as-is.
A: debian VM images do not have tmpfs mounted at /tmp
Yeah... I noticed that while playing with my fedora tmpfs changes. AFAICT that is the Debian default, not something we do in VM creation, so I left it as-is.
filled https://github.com/containers/automation_images/issues/350 to track this, but first let's see how much of a difference this really makes here first
I think you're onto something. Ballpark shows tests running in ~30m, down from ~40
type | distro | user | DB | local | remote | container |
---|---|---|---|---|---|---|
int | rawhide | root | 27:37 | 29:44 | ||
int | rawhide | rootless | 28:55 | |||
int | fedora-39 | root | 27:42 | 28:59 | !31:15 | |
int | fedora-39 | rootless | 27:02 | |||
int | fedora-38 | root | boltdb | 38:25 | 35:19 | !29:37 |
int | fedora-38 | rootless | boltdb | !28:17 | ||
int | debian-13 | root | 32:49 | 31:52 | ||
int | debian-13 | rootless | 29:21 | |||
sys | rawhide | root | 01:04 | 01:13 | ||
sys | rawhide | rootless | 01:07 |
Yeah that looks like a solid start, not sure how accurate the time report is https://api.cirrus-ci.com/v1/artifact/task/4837578803773440/runner_stats/int%20podman%20fedora-39%20root%20host%20sqlite-runner_stats.log https://api.cirrus-ci.com/v1/artifact/task/5963478710616064/runner_stats/int%20podman%20fedora-38%20root%20host%20boltdb-runner_stats.log
We clearly got the system time down a lot which signals IO is a problem, also we still did not max out the CPU per the stats reporting in the cirrus UI so this surprises me a bit because locally this goes full out and I have close to 100% CPU usage on all cores. But maybe the cirrus graph is not counting everything?
@edsantiago What do you think about the TMPDIR
change? No idea if it will pass but it gives us a random dir each time so it should not allow anyone to hard code /tmp/something paths. It is always on /tmp though so unless we manually mount some new tmpfs to a random root dir I don't think there is a better way.
I like it.
Oops: containerized
Don't re-push yet though; I'm really curious to see how this'll work out
Which is perfect as it shows my check actually works, and yeah I let it run and will continue tomorrow.
I can't reproduce. Doesn't seem to be AVC. Could it be that the new remount is adding noexec
? (SWAG. I haven't looked into it deeply enough and am unlikely to do so today)
I can't reproduce. Doesn't seem to be AVC. Could it be that the new remount is adding
noexec
? (SWAG. I haven't looked into it deeply enough and am unlikely to do so today)
Which of the failures are you talking about? There just to many to make sense of this. /tmp does not have noexec set, if it would be then no container would be able to run in the e2e tests.
type | distro | user | DB | local | remote | container |
---|---|---|---|---|---|---|
int | rawhide | root | 28:12 | 26:42 | ||
int | rawhide | rootless | 27:36 | |||
int | fedora-39 | root | 28:07 | 27:59 | 27:15 | |
int | fedora-39 | rootless | 27:04 | |||
int | fedora-38 | root | boltdb | 29:11 | 30:03 | 26:52 |
int | fedora-38 | rootless | boltdb | 29:46 | ||
int | debian-13 | root | 29:36 | 28:29 | ||
int | debian-13 | rootless | !29:14 |
Seems to be a bit under 30m now, looking at other PRs they seem to be in the range of 35-40+ mins so I think it is safe to say this is a noticeable change.
I will try to drop the toolbox change as I think it also helps a bit and I only want to see the tmpfs diff.
[+1075s] not ok 285 [125] podman export, alter tarball, re-import
# tags: distro-integration
# (in test file test/system/[125-import.bats, line 89](https://github.com/containers/podman/blob/9fa6d0d5cc375790a143f33817303c32b0846be4/test/system/125-import.bats#L89))
# `tar -C $PODMAN_TMPDIR -rf $PODMAN_TMPDIR/$b_cnt.tar tmp/testfile2' failed with status 2
#
<+ > # # podman rm -t 0 --all --force --ignore
#
<+046ms> # # podman ps --all --external --format {{.ID}} {{.Names}}
#
<+051ms> # # podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
<+045ms> # quay.io/libpod/testimage:20240123 1f6acd4c4a1d
#
<+608ms> # # podman build -t before_change_img /tmp/CI_GsPZ/podman_bats.HLYIBo
<+485ms> # STEP 1/3: FROM quay.io/libpod/testimage:20240123
# STEP 2/3: ADD testfile1 /tmp
# --> a9813c7c65da
# STEP 3/3: WORKDIR /tmp
# COMMIT before_change_img
# --> ef8a0675b6e9
# Successfully tagged localhost/before_change_img:latest
# ef8a0675b6e9ad6e9a85d112ec86f3753290f88b763f592381b0fd406950567d
#
<+009ms> # # podman create --name before_change_cnt before_change_img
<+085ms> # cb630d35167fa2d230f93076774d4fda2f2747f63f3fb07d3a52a25b65fe2287
#
<+009ms> # # podman export -o /tmp/CI_GsPZ/podman_bats.HLYIBo/before_change_cnt.tar before_change_cnt
#
<+159ms> # # podman rm -t 0 -f before_change_cnt
<+075ms> # before_change_cnt
# # tar --delete -f (tmpdir)/before_change_cnt.tar tmp/testfile1
# # tar -C (tmpdir) -rf (tmpdir)/before_change_cnt.tar tmp/testfile2
# tar: Skipping to next header
# tar: Skipping to next header
# tar: Exiting with failure status due to previous errors
I suspect we hit another tar bug now in debian, maybe the same as https://github.com/containers/podman/issues/19407??
Other issue:
rm: cannot remove '/tmp/CI_0FhD/buildah3968082470/mnt': Permission denied
Seems to be the removal of the new TMPDIR I create, not seen on every run but seem to be rootless only so I guess it could be due missing podman unshare
and leaked files without different owners...
But the real issue should be we are leaking temporarily buildah files which is really not good.
debian VMs got a broken tar despite all efforts: https://github.com/containers/automation_images/pull/349#issuecomment-2095825625
@Luap99 if you intend to also bump the memory size of the VMs in this PR, one word of caution:
GCP is "funny" about the CPUs-to-Memory ratio. It's weirdly complex, like the # CPUs must be <= 1/2 Ram Gb. In any case, you'll get a mostly-readable error message if it's not happy with the values.
Also in case it helps, you can now specify the GCP VM's by type:
name (similar to how AWS does it). This may be a better way to go (now or in the future) because not all "types" are created equal: Some use newer/better/faster CPU models, or come with built-in optomizations. IIRC, if you specify CPU/RAM instead of type:
, it gives you "first available" which may not be the latest/greatest Intel/AMD CPU.
@Luap99 if you intend to also bump the memory size of the VMs in this PR, one word of caution:
GCP is "funny" about the CPUs-to-Memory ratio. It's weirdly complex, like the # CPUs must be <= 1/2 Ram Gb. In any case, you'll get a mostly-readable error message if it's not happy with the values.
Also in case it helps, you can now specify the GCP VM's by
type:
name (similar to how AWS does it). This may be a better way to go (now or in the future) because not all "types" are created equal: Some use newer/better/faster CPU models, or come with built-in optomizations. IIRC, if you specify CPU/RAM instead oftype:
, it gives you "first available" which may not be the latest/greatest Intel/AMD CPU.
Yes sure that is the next step but one step at the time, for now lets get the tmpfs change done.
Yes sure that is the next step but one step at the time, for now lets get the tmpfs change done.
Yep, np. Was just concerned if maybe there were some failures due to OOM (I haven't looked).
Cockpit tests failed for commit 385c4931653d1a66536f6b7c05597dbf584de333. @martinpitt, @jelly, @mvollmer please check.
Ephemeral COPR build failed. @containers/packit-build please check.
This is good to review now, I think tests are going to pass this time around @cevich @edsantiago @containers/podman-maintainers PTAL
Cockpit tests failed for commit 92338642d14adca907e6002dee82255a3f9032ef. @martinpitt, @jelly, @mvollmer please check.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: edsantiago, Luap99
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [Luap99,edsantiago]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
Timing results:
type | distro | user | DB | local | remote | container |
---|---|---|---|---|---|---|
int | rawhide | root | 31:02 | 26:26 | ||
int | rawhide | rootless | 29:28 | |||
int | fedora-40 | root | 28:04 | 26:47 | 26:28 | |
int | fedora-40 | rootless | 27:42 | |||
int | fedora-39 | root | boltdb | 28:11 | 30:25 | 28:29 |
int | fedora-39 | rootless | boltdb | 29:47 | ||
int | debian-13 | root | 29:23 | 28:56 | ||
int | debian-13 | rootless | 25:56 |
Seems to shave 2-5 minutes on podman local, maybe 5-8 remote.
/lgtm