podman icon indicating copy to clipboard operation
podman copied to clipboard

Ed's pet PR with no flake retries

Open edsantiago opened this issue 2 years ago • 37 comments

This should make system tests work with the requested DB

Signed-off-by: Ed Santiago [email protected]

None

edsantiago avatar Mar 17 '23 13:03 edsantiago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Mar 17 '23 13:03 openshift-ci[bot]

Debian Build barfing with:

time="2023-03-17T14:06:47Z"
    level=error
    msg="reading system config \"/etc/containers/containers.conf\":
       decode configuration /etc/containers/containers.conf: toml: line 4: Key 'engine' has already been defined."

...but all other Build tasks (Fedora) pass. What is different about Debian TOML parser? And if Debian is different in this regard, what other Debian differences are there, and maybe one of those explains #17607?

edsantiago avatar Mar 17 '23 14:03 edsantiago

Oh, never mind; this is probably because Debian gets 'engine ... = runc'. TOML is annoying.

edsantiago avatar Mar 17 '23 14:03 edsantiago

It works! It works! Lots of failures:

Error: adding container id to database: database is locked

...but these happen in int tests, too, except that those get retried, these don't. @mheon PTAL.

edsantiago avatar Mar 17 '23 17:03 edsantiago

Oh, and @vrothberg, this minimizes the need for an envariable

edsantiago avatar Mar 17 '23 17:03 edsantiago

The bash here scares me, but LGTM

mheon avatar Mar 17 '23 17:03 mheon

# Error: adding container id to database: database is locked

That's new

mheon avatar Mar 17 '23 17:03 mheon

@mheon no, what I mean is, what do we do with this? Obviously we can't enable it, because sqlite isn't ready. Are these failures helpful to you? (And no, "database is locked" is not new, look at int tests logs for any PR this week. It's all over the place, just, int tests get retried so these are not usually hard failures.)

edsantiago avatar Mar 17 '23 17:03 edsantiago

@edsantiago My understanding was that it is ready, and all tests should be passing. The fact that we're flaking frequently is news to me, but I haven't been spending much time in PRs this week.

mheon avatar Mar 17 '23 17:03 mheon

Per SQLite docs it looks like internal concurrency issue across the same DB connection, which I thought their locking would handle... Looking further.

mheon avatar Mar 17 '23 17:03 mheon

Just looking at the very first int log I could find today, here ya go, search in-page for "database is locked". (Weird, though, those tests pass. The "UNIQUE constraint failed" errors are hard though. I assume you know all about those.

edsantiago avatar Mar 17 '23 17:03 edsantiago

We have some pretty clear concurrency issues on display here. The UNIQUE constraint bits look like a DB creation race that I thought would be impossible, but should be fairly easy to code around. The DB locking one is far more problematic. @vrothberg From my reading here, we may need to start using separate DB connections within the same Podman process for each access, or otherwise introduce some sort of locking to prevent more than 1 access to the DB at a time, because the locking on the connection doesn't seem to be sufficient to prevent this sort of thing.

mheon avatar Mar 17 '23 17:03 mheon

I'm sorry. These errors are everywhere, constant, so I genuinely thought they were part of the WIP sqlite migration. Didn't even bother to file flakes for them. For the locking, would transactions help?

edsantiago avatar Mar 17 '23 18:03 edsantiago

The issue appears to be that SQLite is allowing a write transaction to fire at the same time as another thread is reading the same table, resulting in errors. I don't think we want to wrap all reads in transactions for perf reasons, unfortunately.

The confusing bit is that this seems to contradict the sqlite docs, which say that by default all accesses through a single connection are sequential only. Simplest fix might just be added a mutex to control access and ensure that remains true.

mheon avatar Mar 17 '23 19:03 mheon

@vrothberg and I spent a while discussing conf options this week, including the .d possibility. The perl script isn't needed for that... but it will be needed at some point for testing containers.conf overrides. I'll follow up on Monday.

edsantiago avatar Mar 19 '23 02:03 edsantiago

Using .d should "just work" in CI even when a test sets a custom containers.conf via CONTAINERS_CONF.

vrothberg avatar Mar 20 '23 07:03 vrothberg

Using .d should "just work" in CI even when a test sets a custom containers.conf via CONTAINERS_CONF.

Actually not. Whenever CONTAINERS_CONF is set, only this file will be loaded and all system/user configs will be ignored.

vrothberg avatar Mar 20 '23 10:03 vrothberg

We've been talking about CONTAINERS_CONF_D or something where all of the normal CONTAINERS_CONF processing would be done, but the value of then environment variable would be processed at the end, perhaps it is time to add that.

rhatdan avatar Mar 20 '23 10:03 rhatdan

We've been talking about CONTAINERS_CONF_D or something where all of the normal CONTAINERS_CONF processing would be done, but the value of then environment variable would be processed at the end, perhaps it is time to add that.

I am working on that at the moment :^)

vrothberg avatar Mar 20 '23 10:03 vrothberg

[+0666s] not ok 217 events log-file rotation

This test will likely fail because of CONTAINERS_CONF foo.

vrothberg avatar Mar 21 '23 12:03 vrothberg

[+0801s] not ok 285 podman userns=auto in config file

This one as well.

@edsantiago, can we skip these tests for now with sqlite until we extended the containers.conf loading mechanism?

vrothberg avatar Mar 21 '23 12:03 vrothberg

I've fixed the userns test (it was buggy). It should pass on my next push here. logrotate, sure, I can skip on my next push.

edsantiago avatar Mar 21 '23 13:03 edsantiago

I'm a little concerned about the remote timing results. Not enough to actually do anything about it, just concerned enough to log for future evaluation.

type distro user DB local remote container
int debian-12 root 27:42 38:59
int fedora-36 root 32:28 34:56 33:13
int fedora-37 root 27:14 34:00 27:23
int fedora-37 root sqlite 29:46 33:49
int debian-12 rootless 27:14
int fedora-36 rootless 30:18
int fedora-37 rootless 28:18
int fedora-37 rootless sqlite 28:52
sys debian-12 root 26:18 17:43
sys fedora-36 root 30:19 20:32
sys fedora-37 root 29:25 19:24
sys fedora-37 root sqlite 28:37 19:35
sys fedora-37-aarch64 root 32:40 22:35
sys debian-12 rootless 27:54
sys fedora-36 rootless 30:27
sys fedora-37 rootless 27:36 18:01
sys fedora-37 rootless sqlite 27:22

edsantiago avatar Mar 22 '23 13:03 edsantiago

@cevich heads-up there's something weird going on with Cirrus or Github. Like, this job says it took 1h1m to run, but truth is 32m. And this other job is claiming to still be running, at 39m, but that's super improbable. I bet it's finished but something is delaying somewhere. Ah, and in the minutes it took me to type this, the job finished.

Anyhow, seems to be happening on a lot of jobs today. Easy to catch with my cirrus-pr-timing script, but unfortunately not easy to spot in the Github UI.

edsantiago avatar Mar 24 '23 14:03 edsantiago

Restarted the two failed jobs (infra flake with something-something-java).

vrothberg avatar Mar 27 '23 11:03 vrothberg

With _sync=FULL, and with Eventually() (i.e. reverting my Expect() change): f36 root and debian rootless flaked. Sorry.

edsantiago avatar Apr 07 '23 20:04 edsantiago

diff --git a/libpod/sqlite_state.go b/libpod/sqlite_state.go
index 014edd0f462c..a44b5a966088 100644
--- a/libpod/sqlite_state.go
+++ b/libpod/sqlite_state.go
@@ -33,7 +33,7 @@ const (
        // Deal with timezone automatically.
        sqliteOptionLocation = "_loc=auto"
        // Force WAL mode to fsync after each transaction (https://www.sqlite.org/pragma.html#pragma_synchronous).
-       sqliteOptionSynchronous = "&_sync=FULL"
+       sqliteOptionSynchronous = "&_sync=EXTRA"
        // Allow foreign keys (https://www.sqlite.org/pragma.html#pragma_foreign_keys).
        sqliteOptionForeignKeys = "&_foreign_keys=1"
        // Make sure that transactions happen exclusively.

That would be a last resort. If that doesn't work either, I will get out of the rabbit hole and follow another theory.

vrothberg avatar Apr 11 '23 08:04 vrothberg

Sorry.

edsantiago avatar Apr 11 '23 12:04 edsantiago

@vrothberg I'm keeping this alive for many reasons: ~~WAL testing~~ (no longer needed), sqlite testing (still desired), and because disabling flakeAttempts has been so useful at identifying hidden flakes (i.e., getting us closer to solving #18188). Title changed to reflect its new purpose.

edsantiago avatar Apr 19 '23 10:04 edsantiago

Thank you, @edsantiago!

Given bug week is around the corner, please remind everybody which issues you like to see fixed. Cc: @mheon

vrothberg avatar Apr 20 '23 07:04 vrothberg