buildah icon indicating copy to clipboard operation
buildah copied to clipboard

bats tests - parallelize

Open edsantiago opened this issue 1 year ago • 6 comments

All bats tests run with custom root/runroot, so it should be possible to parallelize them.

(As of this initial commit, tests fail on my laptop, and I expect them to fail here. I just want to get a sense for how things go.)

Signed-off-by: Ed Santiago [email protected]

None

edsantiago avatar May 29 '24 17:05 edsantiago

Ephemeral COPR build failed. @containers/packit-build please check.

A friendly reminder that this PR had no activity for 30 days.

github-actions[bot] avatar Sep 20 '24 00:09 github-actions[bot]

(A brief look after a link from https://github.com/containers/skopeo/issues/2437 )

zstd found even though push was without --force-compression: My first guess is that the FROM alpine; _EOF - built image (via build caching?) matches an image created in some other concurrently running test, and that causes BlobInfoCache to have a record about the zstd-compressed layer version.

For this purpose, it would be better to have the image’s layers be clearly independent from any other test — maybe a FROM scratch adding a file that contains the test name + timestamp.

mtrmac avatar Oct 10 '24 16:10 mtrmac

… oh, and: debug-level logs could help.

mtrmac avatar Oct 10 '24 16:10 mtrmac

For this purpose, it would be better to have the image’s layers be clearly independent from any other test — maybe a FROM scratch adding a file that contains the test name + timestamp.

Thank you. I thought I had checked for conflicts, but must've missed something. I'll look into this again when time allows.

edsantiago avatar Oct 10 '24 17:10 edsantiago

push test still flaking, and I'm giving up for the day. It is stumping me:

not ok 50 bud: build push with --force-compression

# # buildah build ...
...
# f0e6fc41f58b93d990cb24331c648bc84b066822d06782aec493d97bc2b7b263
# # buildah push [...]--compression-format gzip img-t50-nesrbrf5 docker://localhost:35311/img-t50-nesrbrf5
...
# # buildah push [...] --compression-format zstd --force-compression=false img-t50-nesrbrf5 docker://localhost:35311/img-t50-nesrbrf5
...
# # skopeo inspect img-t50-nesrbrf5
# {"schemaVersion":2,"mediaType":"application/vnd.oci.image.manifest.v1+json","config":{"mediaType":"application/vnd.oci.image.config.v1+json","digest":"sha256:f0e6fc41f58b93d990cb24331c648bc84b066822d06782aec493d97bc2b7b263","size":524},"layers":[{"mediaType":"application/vnd.oci.image.layer.v1.tar+zstd","digest":"sha256:05d36d22e1ad1534254c6965a3b43cf39f4dca9d5c95551eccf40108f076da2b","size":146}],"annotations":{"org.opencontainers.image.base.digest":"","org.opencontainers.image.base.name":""}}
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #|     FAIL: zstd found even though push was without --force-compression
# #| expected: !~ 'zstd'

Where is that 05d36 layer coming from?

edsantiago avatar Nov 04 '24 22:11 edsantiago

I can’t see any obvious reason. I’d suggest doing the pushes with --log-level=debug, that should include traces like Trying to reuse blob

mtrmac avatar Nov 05 '24 16:11 mtrmac

I understand this pull request is in draft state, but I got this warning on OpenScanHub:

Error: SHELLCHECK_WARNING ([CWE-398](https://cwe.mitre.org/data/definitions/398.html)): [[#def1]](https://openscanhub.fedoraproject.org/task/21458/log/added.html#def1)
/usr/share/buildah/test/system/helpers.bash:211:24: warning[[SC2115](https://github.com/koalaman/shellcheck/wiki/SC2115)]: Use "${var:?}" to ensure this never expands to / .
#  209|               else
#  210|                   # Failed. Clean up, so we don't leave incomplete remnants
#  211|->                 rm -fr $_BUILDAH_IMAGE_CACHEDIR/$fname
#  212|               fi
#  213|

It should be fixed before merging this pull request.

siteshwar avatar Nov 07 '24 14:11 siteshwar

Got it with debug. In-page search for 10b119 demonstrates the flake.

edsantiago avatar Nov 07 '24 17:11 edsantiago

another one

edsantiago avatar Nov 07 '24 23:11 edsantiago

Got it with debug. In-page search for 10b119 demonstrates the flake.

Note to self: First:

[+2477s] # time="2024-11-07T16:25:26Z" level=debug msg="Ignoring BlobInfoCache record of digest \"sha256:e4216d41a8ad46a3b75f9cdc2f40cab3cd56837931d4e73be4521a9537d7cde1\", uncompressed format does not match

OK. Then:

[+2477s] # time="2024-11-07T16:26:03Z" level=debug msg="Ignoring BlobInfoCache record of digest \"sha256:e4216d41a8ad46a3b75f9cdc2f40cab3cd56837931d4e73be4521a9537d7cde1\" with unknown compression"

How did we lose the compression format knowledge??

That’s not the immediate cause, but it suggests something unexpected is happening. Both have Using SQLite blob info cache at /var/lib/containers/cache/blob-info-cache-v1.sqlite.

mtrmac avatar Nov 12 '24 17:11 mtrmac

The test image is created via

FROM scratch
COPY /therecanbeonly1 /uniquefile     (thecanbeonly1 contains date + random content)

It is the same image used for each test iteration (push --compression-format gzip, zstd --force-compression=false, zstd, and zstd --force-compression).

One approach I've considered but not tried: build a new image on each iteration. My gut tells me that this might get tests to pass, but is not necessarily the right thing to do. It depends on whether this issue is a real one that we might be sweeping under the rug.

edsantiago avatar Nov 13 '24 13:11 edsantiago

@Luap99 Do you have any hints what happend last with this PR, I would like to take over this PR.

flouthoc avatar Jan 20 '25 19:01 flouthoc

Look at Ed's comments and fix them up, also given he said this is flaking repush several times to ensure this is running stable before we can consider merging.

Luap99 avatar Jan 21 '25 10:01 Luap99

Looks like last few commits fix the flake caused due to parallelize, I am gonna push a few times more.

flouthoc avatar Jan 25 '25 02:01 flouthoc

Okay following PR is ready for review. Flakes which Ed found should are fixed in above commits.

I found one more flake but that is unrelated to changes in this PR and exists in upstream ( https://cirrus-ci.com/task/6143954845958144 ) sample PR in upstream https://github.com/containers/buildah/pull/5943

@containers/buildah-maintainers PTAL

flouthoc avatar Jan 25 '25 22:01 flouthoc

Containerized IntegrationSuccessful in 53m

Looks like this does not bump the VM for the Containerized test so that run is still slow. The in_podman_task in cirrus.yml would need to get the same gce_instance fix like the other integration tasks

LGTM otherwise

@Luap99 Yes I will open a seperate PR for containerized_integration, just wanted to get ed's PR in for this one.

flouthoc avatar Jan 27 '25 18:01 flouthoc

@rhatdan @nalind @TomSweeneyRedHat PTAL

flouthoc avatar Jan 27 '25 18:01 flouthoc

/approve /lgtm

rhatdan avatar Jan 27 '25 20:01 rhatdan

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Jan 27 '25 20:01 openshift-ci[bot]

Created a followup PR: https://github.com/containers/buildah/pull/5947

flouthoc avatar Jan 27 '25 20:01 flouthoc