podman icon indicating copy to clipboard operation
podman copied to clipboard

WIP: run e2e test on tmpfs

Open Luap99 opened this issue 9 months ago • 20 comments

Follow up to commit eaf60c7fe7, lets see how bad things are going to break.

Does this PR introduce a user-facing change?

None

Luap99 avatar Apr 29 '24 13:04 Luap99

Ephemeral COPR build failed. @containers/packit-build please check.

Error logs are misleading. Actual error seems to be in BeforeEtc:

  # podman [options] load -q -i /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar
           Error: payload does not match any of the supported image formats:
            * oci: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/index.json: not a directory
            * oci-archive: loading index: open /var/tmp/container_images_oci1129938589/index.json: no such file or directory
            * docker-archive: writing blob: adding layer with blob "sha256:5067dfc06cd159373bab350ebcb8986dd729e64324592b6f28bbcdb6fc5efda0": processing tar file(write /usr/lib64/dri/nouveau_drv_video.so: no space left on device): exit status 1
            * dir: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/manifest.json: not a directory

edsantiago avatar Apr 29 '24 14:04 edsantiago

Error logs are misleading. Actual error seems to be in BeforeEtc:

  # podman [options] load -q -i /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar
           Error: payload does not match any of the supported image formats:
            * oci: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/index.json: not a directory
            * oci-archive: loading index: open /var/tmp/container_images_oci1129938589/index.json: no such file or directory
            * docker-archive: writing blob: adding layer with blob "sha256:5067dfc06cd159373bab350ebcb8986dd729e64324592b6f28bbcdb6fc5efda0": processing tar file(write /usr/lib64/dri/nouveau_drv_video.so: no space left on device): exit status 1
            * dir: open /var/tmp/registry.fedoraproject.org-fedora-toolbox-36.tar/manifest.json: not a directory

Yeah I think the first step is to remove the big images we use, the image dir is already over 1.1G so we should get this down as first step. Maybe we can get away with 4 Gb RAM then. Also why the hell did this even run if the setup failed?! That needs fixing too.

Luap99 avatar Apr 29 '24 15:04 Luap99

This maybe?

diff --git a/test/e2e/common_test.go b/test/e2e/common_test.go
index bc1f73b82..f617c3a58 100644
--- a/test/e2e/common_test.go
+++ b/test/e2e/common_test.go
@@ -1027,6 +1027,7 @@ func (p *PodmanTestIntegration) RestoreArtifactToCache(image string) error {
 		p.Root = p.ImageCacheDir
 		restore := p.PodmanNoEvents([]string{"load", "-q", "-i", tarball})
 		restore.WaitWithDefaultTimeout()
+		Expect(restore).To(ExitCleanly())
 	}
 	return nil
 }

edsantiago avatar Apr 29 '24 16:04 edsantiago

This maybe?

diff --git a/test/e2e/common_test.go b/test/e2e/common_test.go
index bc1f73b82..f617c3a58 100644
--- a/test/e2e/common_test.go
+++ b/test/e2e/common_test.go
@@ -1027,6 +1027,7 @@ func (p *PodmanTestIntegration) RestoreArtifactToCache(image string) error {
 		p.Root = p.ImageCacheDir
 		restore := p.PodmanNoEvents([]string{"load", "-q", "-i", tarball})
 		restore.WaitWithDefaultTimeout()
+		Expect(restore).To(ExitCleanly())
 	}
 	return nil
 }

yes

Luap99 avatar Apr 29 '24 16:04 Luap99

Q: why did debian pass? A: debian VM images do not have tmpfs mounted at /tmp

Luap99 avatar Apr 29 '24 16:04 Luap99

A: debian VM images do not have tmpfs mounted at /tmp

Yeah... I noticed that while playing with my fedora tmpfs changes. AFAICT that is the Debian default, not something we do in VM creation, so I left it as-is.

edsantiago avatar Apr 29 '24 16:04 edsantiago

A: debian VM images do not have tmpfs mounted at /tmp

Yeah... I noticed that while playing with my fedora tmpfs changes. AFAICT that is the Debian default, not something we do in VM creation, so I left it as-is.

filled https://github.com/containers/automation_images/issues/350 to track this, but first let's see how much of a difference this really makes here first

Luap99 avatar Apr 29 '24 16:04 Luap99

I think you're onto something. Ballpark shows tests running in ~30m, down from ~40

type distro user DB local remote container
int rawhide root 27:37 29:44
int rawhide rootless 28:55
int fedora-39 root 27:42 28:59 !31:15
int fedora-39 rootless 27:02
int fedora-38 root boltdb 38:25 35:19 !29:37
int fedora-38 rootless boltdb !28:17
int debian-13 root 32:49 31:52
int debian-13 rootless 29:21
sys rawhide root 01:04 01:13
sys rawhide rootless 01:07

edsantiago avatar Apr 29 '24 17:04 edsantiago

Yeah that looks like a solid start, not sure how accurate the time report is https://api.cirrus-ci.com/v1/artifact/task/4837578803773440/runner_stats/int%20podman%20fedora-39%20root%20host%20sqlite-runner_stats.log https://api.cirrus-ci.com/v1/artifact/task/5963478710616064/runner_stats/int%20podman%20fedora-38%20root%20host%20boltdb-runner_stats.log

We clearly got the system time down a lot which signals IO is a problem, also we still did not max out the CPU per the stats reporting in the cirrus UI so this surprises me a bit because locally this goes full out and I have close to 100% CPU usage on all cores. But maybe the cirrus graph is not counting everything?

Luap99 avatar Apr 29 '24 18:04 Luap99

@edsantiago What do you think about the TMPDIR change? No idea if it will pass but it gives us a random dir each time so it should not allow anyone to hard code /tmp/something paths. It is always on /tmp though so unless we manually mount some new tmpfs to a random root dir I don't think there is a better way.

Luap99 avatar May 02 '24 16:05 Luap99

I like it.

edsantiago avatar May 02 '24 16:05 edsantiago

Oops: containerized

Don't re-push yet though; I'm really curious to see how this'll work out

edsantiago avatar May 02 '24 17:05 edsantiago

Oops: containerized

Don't re-push yet though; I'm really curious to see how this'll work out

Which is perfect as it shows my check actually works, and yeah I let it run and will continue tomorrow.

Luap99 avatar May 02 '24 17:05 Luap99

I can't reproduce. Doesn't seem to be AVC. Could it be that the new remount is adding noexec? (SWAG. I haven't looked into it deeply enough and am unlikely to do so today)

edsantiago avatar May 02 '24 18:05 edsantiago

I can't reproduce. Doesn't seem to be AVC. Could it be that the new remount is adding noexec? (SWAG. I haven't looked into it deeply enough and am unlikely to do so today)

Which of the failures are you talking about? There just to many to make sense of this. /tmp does not have noexec set, if it would be then no container would be able to run in the e2e tests.

Luap99 avatar May 03 '24 09:05 Luap99

type distro user DB local remote container
int rawhide root 28:12 26:42
int rawhide rootless 27:36
int fedora-39 root 28:07 27:59 27:15
int fedora-39 rootless 27:04
int fedora-38 root boltdb 29:11 30:03 26:52
int fedora-38 rootless boltdb 29:46
int debian-13 root 29:36 28:29
int debian-13 rootless !29:14

Seems to be a bit under 30m now, looking at other PRs they seem to be in the range of 35-40+ mins so I think it is safe to say this is a noticeable change.

I will try to drop the toolbox change as I think it also helps a bit and I only want to see the tmpfs diff.

Luap99 avatar May 03 '24 11:05 Luap99

[+1075s] not ok 285 [125] podman export, alter tarball, re-import
         # tags: distro-integration
         # (in test file test/system/[125-import.bats, line 89](https://github.com/containers/podman/blob/9fa6d0d5cc375790a143f33817303c32b0846be4/test/system/125-import.bats#L89))
         #   `tar -C $PODMAN_TMPDIR -rf $PODMAN_TMPDIR/$b_cnt.tar tmp/testfile2' failed with status 2
         #
<+     > # # podman rm -t 0 --all --force --ignore
         #
<+046ms> # # podman ps --all --external --format {{.ID}} {{.Names}}
         #
<+051ms> # # podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
<+045ms> # quay.io/libpod/testimage:20240123 1f6acd4c4a1d
         #
<+608ms> # # podman build -t before_change_img /tmp/CI_GsPZ/podman_bats.HLYIBo
<+485ms> # STEP 1/3: FROM quay.io/libpod/testimage:20240123
         # STEP 2/3: ADD testfile1 /tmp
         # --> a9813c7c65da
         # STEP 3/3: WORKDIR /tmp
         # COMMIT before_change_img
         # --> ef8a0675b6e9
         # Successfully tagged localhost/before_change_img:latest
         # ef8a0675b6e9ad6e9a85d112ec86f3753290f88b763f592381b0fd406950567d
         #
<+009ms> # # podman create --name before_change_cnt before_change_img
<+085ms> # cb630d35167fa2d230f93076774d4fda2f2747f63f3fb07d3a52a25b65fe2287
         #
<+009ms> # # podman export -o /tmp/CI_GsPZ/podman_bats.HLYIBo/before_change_cnt.tar before_change_cnt
         #
<+159ms> # # podman rm -t 0 -f before_change_cnt
<+075ms> # before_change_cnt
         # # tar --delete -f (tmpdir)/before_change_cnt.tar tmp/testfile1
         # # tar -C (tmpdir) -rf (tmpdir)/before_change_cnt.tar tmp/testfile2
         # tar: Skipping to next header
         # tar: Skipping to next header
         # tar: Exiting with failure status due to previous errors

I suspect we hit another tar bug now in debian, maybe the same as https://github.com/containers/podman/issues/19407??


Other issue:

rm: cannot remove '/tmp/CI_0FhD/buildah3968082470/mnt': Permission denied

Seems to be the removal of the new TMPDIR I create, not seen on every run but seem to be rootless only so I guess it could be due missing podman unshare and leaked files without different owners... But the real issue should be we are leaking temporarily buildah files which is really not good.

Luap99 avatar May 03 '24 16:05 Luap99

debian VMs got a broken tar despite all efforts: https://github.com/containers/automation_images/pull/349#issuecomment-2095825625

edsantiago avatar May 06 '24 11:05 edsantiago

@Luap99 if you intend to also bump the memory size of the VMs in this PR, one word of caution:

GCP is "funny" about the CPUs-to-Memory ratio. It's weirdly complex, like the # CPUs must be <= 1/2 Ram Gb. In any case, you'll get a mostly-readable error message if it's not happy with the values.

Also in case it helps, you can now specify the GCP VM's by type: name (similar to how AWS does it). This may be a better way to go (now or in the future) because not all "types" are created equal: Some use newer/better/faster CPU models, or come with built-in optomizations. IIRC, if you specify CPU/RAM instead of type:, it gives you "first available" which may not be the latest/greatest Intel/AMD CPU.

cevich avatar May 07 '24 20:05 cevich

@Luap99 if you intend to also bump the memory size of the VMs in this PR, one word of caution:

GCP is "funny" about the CPUs-to-Memory ratio. It's weirdly complex, like the # CPUs must be <= 1/2 Ram Gb. In any case, you'll get a mostly-readable error message if it's not happy with the values.

Also in case it helps, you can now specify the GCP VM's by type: name (similar to how AWS does it). This may be a better way to go (now or in the future) because not all "types" are created equal: Some use newer/better/faster CPU models, or come with built-in optomizations. IIRC, if you specify CPU/RAM instead of type:, it gives you "first available" which may not be the latest/greatest Intel/AMD CPU.

Yes sure that is the next step but one step at the time, for now lets get the tmpfs change done.

Luap99 avatar May 08 '24 09:05 Luap99

Yes sure that is the next step but one step at the time, for now lets get the tmpfs change done.

Yep, np. Was just concerned if maybe there were some failures due to OOM (I haven't looked).

cevich avatar May 08 '24 18:05 cevich

Cockpit tests failed for commit 385c4931653d1a66536f6b7c05597dbf584de333. @martinpitt, @jelly, @mvollmer please check.

Ephemeral COPR build failed. @containers/packit-build please check.

This is good to review now, I think tests are going to pass this time around @cevich @edsantiago @containers/podman-maintainers PTAL

Luap99 avatar May 13 '24 15:05 Luap99

Cockpit tests failed for commit 92338642d14adca907e6002dee82255a3f9032ef. @martinpitt, @jelly, @mvollmer please check.

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [Luap99,edsantiago]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar May 13 '24 16:05 openshift-ci[bot]

Timing results:

type distro user DB local remote container
int rawhide root 31:02 26:26
int rawhide rootless 29:28
int fedora-40 root 28:04 26:47 26:28
int fedora-40 rootless 27:42
int fedora-39 root boltdb 28:11 30:25 28:29
int fedora-39 rootless boltdb 29:47
int debian-13 root 29:23 28:56
int debian-13 rootless 25:56

Seems to shave 2-5 minutes on podman local, maybe 5-8 remote.

edsantiago avatar May 13 '24 17:05 edsantiago

/lgtm

rhatdan avatar May 13 '24 18:05 rhatdan