podman Ed's pet PR with no flake retries

This should make system tests work with the requested DB

Signed-off-by: Ed Santiago [email protected]

None

Mar 17 '23 13:03 edsantiago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [edsantiago]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Mar 17 '23 13:03 openshift-ci[bot]

Debian Build barfing with:

time="2023-03-17T14:06:47Z"
    level=error
    msg="reading system config \"/etc/containers/containers.conf\":
       decode configuration /etc/containers/containers.conf: toml: line 4: Key 'engine' has already been defined."

...but all other Build tasks (Fedora) pass. What is different about Debian TOML parser? And if Debian is different in this regard, what other Debian differences are there, and maybe one of those explains #17607?

Mar 17 '23 14:03 edsantiago

Oh, never mind; this is probably because Debian gets 'engine ... = runc'. TOML is annoying.

Mar 17 '23 14:03 edsantiago

It works! It works! Lots of failures:

Error: adding container id to database: database is locked

...but these happen in int tests, too, except that those get retried, these don't. @mheon PTAL.

Mar 17 '23 17:03 edsantiago

Oh, and @vrothberg, this minimizes the need for an envariable

Mar 17 '23 17:03 edsantiago

The bash here scares me, but LGTM

Mar 17 '23 17:03 mheon

# Error: adding container id to database: database is locked

That's new

Mar 17 '23 17:03 mheon

@mheon no, what I mean is, what do we do with this? Obviously we can't enable it, because sqlite isn't ready. Are these failures helpful to you? (And no, "database is locked" is not new, look at int tests logs for any PR this week. It's all over the place, just, int tests get retried so these are not usually hard failures.)

Mar 17 '23 17:03 edsantiago

@edsantiago My understanding was that it is ready, and all tests should be passing. The fact that we're flaking frequently is news to me, but I haven't been spending much time in PRs this week.

Mar 17 '23 17:03 mheon

Per SQLite docs it looks like internal concurrency issue across the same DB connection, which I thought their locking would handle... Looking further.

Mar 17 '23 17:03 mheon

Just looking at the very first int log I could find today, here ya go, search in-page for "database is locked". (Weird, though, those tests pass. The "UNIQUE constraint failed" errors are hard though. I assume you know all about those.

Mar 17 '23 17:03 edsantiago

We have some pretty clear concurrency issues on display here. The UNIQUE constraint bits look like a DB creation race that I thought would be impossible, but should be fairly easy to code around. The DB locking one is far more problematic. @vrothberg From my reading here, we may need to start using separate DB connections within the same Podman process for each access, or otherwise introduce some sort of locking to prevent more than 1 access to the DB at a time, because the locking on the connection doesn't seem to be sufficient to prevent this sort of thing.

Mar 17 '23 17:03 mheon

I'm sorry. These errors are everywhere, constant, so I genuinely thought they were part of the WIP sqlite migration. Didn't even bother to file flakes for them. For the locking, would transactions help?

Mar 17 '23 18:03 edsantiago

The issue appears to be that SQLite is allowing a write transaction to fire at the same time as another thread is reading the same table, resulting in errors. I don't think we want to wrap all reads in transactions for perf reasons, unfortunately.

The confusing bit is that this seems to contradict the sqlite docs, which say that by default all accesses through a single connection are sequential only. Simplest fix might just be added a mutex to control access and ensure that remains true.

Mar 17 '23 19:03 mheon

@vrothberg and I spent a while discussing conf options this week, including the .d possibility. The perl script isn't needed for that... but it will be needed at some point for testing containers.conf overrides. I'll follow up on Monday.

Mar 19 '23 02:03 edsantiago

Using .d should "just work" in CI even when a test sets a custom containers.conf via CONTAINERS_CONF.

Mar 20 '23 07:03 vrothberg

Using .d should "just work" in CI even when a test sets a custom containers.conf via CONTAINERS_CONF.

Actually not. Whenever CONTAINERS_CONF is set, only this file will be loaded and all system/user configs will be ignored.

Mar 20 '23 10:03 vrothberg

We've been talking about CONTAINERS_CONF_D or something where all of the normal CONTAINERS_CONF processing would be done, but the value of then environment variable would be processed at the end, perhaps it is time to add that.

Mar 20 '23 10:03 rhatdan

We've been talking about CONTAINERS_CONF_D or something where all of the normal CONTAINERS_CONF processing would be done, but the value of then environment variable would be processed at the end, perhaps it is time to add that.

I am working on that at the moment :^)

Mar 20 '23 10:03 vrothberg

[+0666s] not ok 217 events log-file rotation

This test will likely fail because of CONTAINERS_CONF foo.

Mar 21 '23 12:03 vrothberg

[+0801s] not ok 285 podman userns=auto in config file

This one as well.

@edsantiago, can we skip these tests for now with sqlite until we extended the containers.conf loading mechanism?

Mar 21 '23 12:03 vrothberg

I've fixed the userns test (it was buggy). It should pass on my next push here. logrotate, sure, I can skip on my next push.

Mar 21 '23 13:03 edsantiago

I'm a little concerned about the remote timing results. Not enough to actually do anything about it, just concerned enough to log for future evaluation.

type	distro	user	DB	local	remote	container
int	debian-12	root		27:42	38:59
int	fedora-36	root		32:28	34:56	33:13
int	fedora-37	root		27:14	34:00	27:23
int	fedora-37	root	sqlite	29:46	33:49
int	debian-12	rootless		27:14
int	fedora-36	rootless		30:18
int	fedora-37	rootless		28:18
int	fedora-37	rootless	sqlite	28:52
sys	debian-12	root		26:18	17:43
sys	fedora-36	root		30:19	20:32
sys	fedora-37	root		29:25	19:24
sys	fedora-37	root	sqlite	28:37	19:35
sys	fedora-37-aarch64	root		32:40	22:35
sys	debian-12	rootless		27:54
sys	fedora-36	rootless		30:27
sys	fedora-37	rootless		27:36	18:01
sys	fedora-37	rootless	sqlite	27:22

Mar 22 '23 13:03 edsantiago

@cevich heads-up there's something weird going on with Cirrus or Github. Like, this job says it took 1h1m to run, but truth is 32m. And this other job is claiming to still be running, at 39m, but that's super improbable. I bet it's finished but something is delaying somewhere. Ah, and in the minutes it took me to type this, the job finished.

Anyhow, seems to be happening on a lot of jobs today. Easy to catch with my cirrus-pr-timing script, but unfortunately not easy to spot in the Github UI.

Mar 24 '23 14:03 edsantiago

Restarted the two failed jobs (infra flake with something-something-java).

Mar 27 '23 11:03 vrothberg

With _sync=FULL, and with Eventually() (i.e. reverting my Expect() change): f36 root and debian rootless flaked. Sorry.

Apr 07 '23 20:04 edsantiago

diff --git a/libpod/sqlite_state.go b/libpod/sqlite_state.go
index 014edd0f462c..a44b5a966088 100644
--- a/libpod/sqlite_state.go
+++ b/libpod/sqlite_state.go
@@ -33,7 +33,7 @@ const (
        // Deal with timezone automatically.
        sqliteOptionLocation = "_loc=auto"
        // Force WAL mode to fsync after each transaction (https://www.sqlite.org/pragma.html#pragma_synchronous).
-       sqliteOptionSynchronous = "&_sync=FULL"
+       sqliteOptionSynchronous = "&_sync=EXTRA"
        // Allow foreign keys (https://www.sqlite.org/pragma.html#pragma_foreign_keys).
        sqliteOptionForeignKeys = "&_foreign_keys=1"
        // Make sure that transactions happen exclusively.

That would be a last resort. If that doesn't work either, I will get out of the rabbit hole and follow another theory.

Apr 11 '23 08:04 vrothberg

Sorry.

Apr 11 '23 12:04 edsantiago

@vrothberg I'm keeping this alive for many reasons: ~~WAL testing~~ (no longer needed), sqlite testing (still desired), and because disabling flakeAttempts has been so useful at identifying hidden flakes (i.e., getting us closer to solving #18188). Title changed to reflect its new purpose.

Apr 19 '23 10:04 edsantiago

Thank you, @edsantiago!

Given bug week is around the corner, please remind everybody which issues you like to see fixed. Cc: @mheon

Apr 20 '23 07:04 vrothberg

podman podman copied to clipboard

Ed's pet PR with no flake retries

podman
podman copied to clipboard