podman WIP: run e2e test on tmpfs

Follow up to commit eaf60c7fe7, lets see how bad things are going to break.

Does this PR introduce a user-facing change?

None

Apr 29 '24 13:04 Luap99

Ephemeral COPR build failed. @containers/packit-build please check.

Apr 29 '24 14:04 packit-as-a-service[bot]

Error logs are misleading. Actual error seems to be in BeforeEtc:

  # podman [options] load -q -i /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar
           Error: payload does not match any of the supported image formats:
            * oci: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/index.json: not a directory
            * oci-archive: loading index: open /var/tmp/container_images_oci1129938589/index.json: no such file or directory
            * docker-archive: writing blob: adding layer with blob "sha256:5067dfc06cd159373bab350ebcb8986dd729e64324592b6f28bbcdb6fc5efda0": processing tar file(write /usr/lib64/dri/nouveau_drv_video.so: no space left on device): exit status 1
            * dir: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/manifest.json: not a directory

Apr 29 '24 14:04 edsantiago

Error logs are misleading. Actual error seems to be in BeforeEtc:

  # podman [options] load -q -i /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar
           Error: payload does not match any of the supported image formats:
            * oci: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/index.json: not a directory
            * oci-archive: loading index: open /var/tmp/container_images_oci1129938589/index.json: no such file or directory
            * docker-archive: writing blob: adding layer with blob "sha256:5067dfc06cd159373bab350ebcb8986dd729e64324592b6f28bbcdb6fc5efda0": processing tar file(write /usr/lib64/dri/nouveau_drv_video.so: no space left on device): exit status 1
            * dir: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/manifest.json: not a directory

Yeah I think the first step is to remove the big images we use, the image dir is already over 1.1G so we should get this down as first step. Maybe we can get away with 4 Gb RAM then. Also why the hell did this even run if the setup failed?! That needs fixing too.

Apr 29 '24 15:04 Luap99

This maybe?

diff --git a/test/e2e/common_test.go b/test/e2e/common_test.go
index bc1f73b82..f617c3a58 100644
--- a/test/e2e/common_test.go
+++ b/test/e2e/common_test.go
@@ -1027,6 +1027,7 @@ func (p *PodmanTestIntegration) RestoreArtifactToCache(image string) error {
 		p.Root = p.ImageCacheDir
 		restore := p.PodmanNoEvents([]string{"load", "-q", "-i", tarball})
 		restore.WaitWithDefaultTimeout()
+		Expect(restore).To(ExitCleanly())
 	}
 	return nil
 }

Apr 29 '24 16:04 edsantiago

This maybe?

diff --git a/test/e2e/common_test.go b/test/e2e/common_test.go
index bc1f73b82..f617c3a58 100644
--- a/test/e2e/common_test.go
+++ b/test/e2e/common_test.go
@@ -1027,6 +1027,7 @@ func (p *PodmanTestIntegration) RestoreArtifactToCache(image string) error {
 		p.Root = p.ImageCacheDir
 		restore := p.PodmanNoEvents([]string{"load", "-q", "-i", tarball})
 		restore.WaitWithDefaultTimeout()
+		Expect(restore).To(ExitCleanly())
 	}
 	return nil
 }

yes

Apr 29 '24 16:04 Luap99

Q: why did debian pass? A: debian VM images do not have tmpfs mounted at /tmp

Apr 29 '24 16:04 Luap99

A: debian VM images do not have tmpfs mounted at /tmp

Yeah... I noticed that while playing with my fedora tmpfs changes. AFAICT that is the Debian default, not something we do in VM creation, so I left it as-is.

Apr 29 '24 16:04 edsantiago

A: debian VM images do not have tmpfs mounted at /tmp

Yeah... I noticed that while playing with my fedora tmpfs changes. AFAICT that is the Debian default, not something we do in VM creation, so I left it as-is.

filled https://github.com/containers/automation_images/issues/350 to track this, but first let's see how much of a difference this really makes here first

Apr 29 '24 16:04 Luap99

I think you're onto something. Ballpark shows tests running in ~30m, down from ~40

type	distro	user	DB	local	remote	container
int	rawhide	root		27:37	29:44
int	rawhide	rootless		28:55
int	fedora-39	root		27:42	28:59	!31:15
int	fedora-39	rootless		27:02
int	fedora-38	root	boltdb	38:25	35:19	!29:37
int	fedora-38	rootless	boltdb	!28:17
int	debian-13	root		32:49	31:52
int	debian-13	rootless		29:21
sys	rawhide	root		01:04	01:13
sys	rawhide	rootless		01:07

Apr 29 '24 17:04 edsantiago

Yeah that looks like a solid start, not sure how accurate the time report is https://api.cirrus-ci.com/v1/artifact/task/4837578803773440/runner_stats/int%20podman%20fedora-39%20root%20host%20sqlite-runner_stats.log https://api.cirrus-ci.com/v1/artifact/task/5963478710616064/runner_stats/int%20podman%20fedora-38%20root%20host%20boltdb-runner_stats.log

We clearly got the system time down a lot which signals IO is a problem, also we still did not max out the CPU per the stats reporting in the cirrus UI so this surprises me a bit because locally this goes full out and I have close to 100% CPU usage on all cores. But maybe the cirrus graph is not counting everything?

Apr 29 '24 18:04 Luap99

@edsantiago What do you think about the TMPDIR change? No idea if it will pass but it gives us a random dir each time so it should not allow anyone to hard code /tmp/something paths. It is always on /tmp though so unless we manually mount some new tmpfs to a random root dir I don't think there is a better way.

May 02 '24 16:05 Luap99

I like it.

May 02 '24 16:05 edsantiago

Oops: containerized

Don't re-push yet though; I'm really curious to see how this'll work out

May 02 '24 17:05 edsantiago

Oops: containerized

Don't re-push yet though; I'm really curious to see how this'll work out

Which is perfect as it shows my check actually works, and yeah I let it run and will continue tomorrow.

May 02 '24 17:05 Luap99

I can't reproduce. Doesn't seem to be AVC. Could it be that the new remount is adding noexec? (SWAG. I haven't looked into it deeply enough and am unlikely to do so today)

May 02 '24 18:05 edsantiago

I can't reproduce. Doesn't seem to be AVC. Could it be that the new remount is adding noexec? (SWAG. I haven't looked into it deeply enough and am unlikely to do so today)

Which of the failures are you talking about? There just to many to make sense of this. /tmp does not have noexec set, if it would be then no container would be able to run in the e2e tests.

May 03 '24 09:05 Luap99

type distro user DB local remote container

int rawhide root 28:12 26:42

int rawhide rootless 27:36

int fedora-39 root 28:07 27:59 27:15

int fedora-39 rootless 27:04

int fedora-38 root boltdb 29:11 30:03 26:52

int fedora-38 rootless boltdb 29:46

int debian-13 root 29:36 28:29

int debian-13 rootless !29:14

type	distro	user	DB	local	remote	container
int	rawhide	root		28:12	26:42
int	rawhide	rootless		27:36
int	fedora-39	root		28:07	27:59	27:15
int	fedora-39	rootless		27:04
int	fedora-38	root	boltdb	29:11	30:03	26:52
int	fedora-38	rootless	boltdb	29:46
int	debian-13	root		29:36	28:29
int	debian-13	rootless		!29:14

Seems to be a bit under 30m now, looking at other PRs they seem to be in the range of 35-40+ mins so I think it is safe to say this is a noticeable change.

I will try to drop the toolbox change as I think it also helps a bit and I only want to see the tmpfs diff.

May 03 '24 11:05 Luap99

[+1075s] not ok 285 [125] podman export, alter tarball, re-import
         # tags: distro-integration
         # (in test file test/system/[125-import.bats, line 89](https://github.com/containers/podman/blob/9fa6d0d5cc375790a143f33817303c32b0846be4/test/system/125-import.bats#L89))
         #   `tar -C $PODMAN_TMPDIR -rf $PODMAN_TMPDIR/$b_cnt.tar tmp/testfile2' failed with status 2
         #
<+     > # # podman rm -t 0 --all --force --ignore
         #
<+046ms> # # podman ps --all --external --format {{.ID}} {{.Names}}
         #
<+051ms> # # podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
<+045ms> # quay.io/libpod/testimage:20240123 1f6acd4c4a1d
         #
<+608ms> # # podman build -t before_change_img /tmp/CI_GsPZ/podman_bats.HLYIBo
<+485ms> # STEP 1/3: FROM quay.io/libpod/testimage:20240123
         # STEP 2/3: ADD testfile1 /tmp
         # --> a9813c7c65da
         # STEP 3/3: WORKDIR /tmp
         # COMMIT before_change_img
         # --> ef8a0675b6e9
         # Successfully tagged localhost/before_change_img:latest
         # ef8a0675b6e9ad6e9a85d112ec86f3753290f88b763f592381b0fd406950567d
         #
<+009ms> # # podman create --name before_change_cnt before_change_img
<+085ms> # cb630d35167fa2d230f93076774d4fda2f2747f63f3fb07d3a52a25b65fe2287
         #
<+009ms> # # podman export -o /tmp/CI_GsPZ/podman_bats.HLYIBo/before_change_cnt.tar before_change_cnt
         #
<+159ms> # # podman rm -t 0 -f before_change_cnt
<+075ms> # before_change_cnt
         # # tar --delete -f (tmpdir)/before_change_cnt.tar tmp/testfile1
         # # tar -C (tmpdir) -rf (tmpdir)/before_change_cnt.tar tmp/testfile2
         # tar: Skipping to next header
         # tar: Skipping to next header
         # tar: Exiting with failure status due to previous errors

I suspect we hit another tar bug now in debian, maybe the same as https://github.com/containers/podman/issues/19407??

Other issue:

rm: cannot remove '/tmp/CI_0FhD/buildah3968082470/mnt': Permission denied

Seems to be the removal of the new TMPDIR I create, not seen on every run but seem to be rootless only so I guess it could be due missing podman unshare and leaked files without different owners... But the real issue should be we are leaking temporarily buildah files which is really not good.

May 03 '24 16:05 Luap99

debian VMs got a broken tar despite all efforts: https://github.com/containers/automation_images/pull/349#issuecomment-2095825625

May 06 '24 11:05 edsantiago

@Luap99 if you intend to also bump the memory size of the VMs in this PR, one word of caution:

GCP is "funny" about the CPUs-to-Memory ratio. It's weirdly complex, like the # CPUs must be <= 1/2 Ram Gb. In any case, you'll get a mostly-readable error message if it's not happy with the values.

Also in case it helps, you can now specify the GCP VM's by type: name (similar to how AWS does it). This may be a better way to go (now or in the future) because not all "types" are created equal: Some use newer/better/faster CPU models, or come with built-in optomizations. IIRC, if you specify CPU/RAM instead of type:, it gives you "first available" which may not be the latest/greatest Intel/AMD CPU.

May 07 '24 20:05 cevich

@Luap99 if you intend to also bump the memory size of the VMs in this PR, one word of caution:

GCP is "funny" about the CPUs-to-Memory ratio. It's weirdly complex, like the # CPUs must be <= 1/2 Ram Gb. In any case, you'll get a mostly-readable error message if it's not happy with the values.

Also in case it helps, you can now specify the GCP VM's by type: name (similar to how AWS does it). This may be a better way to go (now or in the future) because not all "types" are created equal: Some use newer/better/faster CPU models, or come with built-in optomizations. IIRC, if you specify CPU/RAM instead of type:, it gives you "first available" which may not be the latest/greatest Intel/AMD CPU.

Yes sure that is the next step but one step at the time, for now lets get the tmpfs change done.

May 08 '24 09:05 Luap99

Yes sure that is the next step but one step at the time, for now lets get the tmpfs change done.

Yep, np. Was just concerned if maybe there were some failures due to OOM (I haven't looked).

May 08 '24 18:05 cevich

Cockpit tests failed for commit 385c4931653d1a66536f6b7c05597dbf584de333. @martinpitt, @jelly, @mvollmer please check.

May 13 '24 11:05 packit-as-a-service[bot]

Ephemeral COPR build failed. @containers/packit-build please check.

May 13 '24 15:05 packit-as-a-service[bot]

This is good to review now, I think tests are going to pass this time around @cevich @edsantiago @containers/podman-maintainers PTAL

May 13 '24 15:05 Luap99

Cockpit tests failed for commit 92338642d14adca907e6002dee82255a3f9032ef. @martinpitt, @jelly, @mvollmer please check.

May 13 '24 16:05 packit-as-a-service[bot]

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,edsantiago]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

May 13 '24 16:05 openshift-ci[bot]

Timing results:

type	distro	user	DB	local	remote	container
int	rawhide	root		31:02	26:26
int	rawhide	rootless		29:28
int	fedora-40	root		28:04	26:47	26:28
int	fedora-40	rootless		27:42
int	fedora-39	root	boltdb	28:11	30:25	28:29
int	fedora-39	rootless	boltdb	29:47
int	debian-13	root		29:23	28:56
int	debian-13	rootless		25:56

Seems to shave 2-5 minutes on podman local, maybe 5-8 remote.

May 13 '24 17:05 edsantiago

/lgtm

May 13 '24 18:05 rhatdan

podman podman copied to clipboard

WIP: run e2e test on tmpfs

Does this PR introduce a user-facing change?

podman
podman copied to clipboard