podman
podman copied to clipboard
Ed's pet PR with no flake retries
This should make system tests work with the requested DB
Signed-off-by: Ed Santiago [email protected]
None
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: edsantiago
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [edsantiago]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Debian Build barfing with:
time="2023-03-17T14:06:47Z"
level=error
msg="reading system config \"/etc/containers/containers.conf\":
decode configuration /etc/containers/containers.conf: toml: line 4: Key 'engine' has already been defined."
...but all other Build tasks (Fedora) pass. What is different about Debian TOML parser? And if Debian is different in this regard, what other Debian differences are there, and maybe one of those explains #17607?
Oh, never mind; this is probably because Debian gets 'engine ... = runc'. TOML is annoying.
It works! It works! Lots of failures:
Error: adding container id to database: database is locked
...but these happen in int tests, too, except that those get retried, these don't. @mheon PTAL.
Oh, and @vrothberg, this minimizes the need for an envariable
The bash here scares me, but LGTM
# Error: adding container id to database: database is locked
That's new
@mheon no, what I mean is, what do we do with this? Obviously we can't enable it, because sqlite isn't ready. Are these failures helpful to you? (And no, "database is locked" is not new, look at int tests logs for any PR this week. It's all over the place, just, int tests get retried so these are not usually hard failures.)
@edsantiago My understanding was that it is ready, and all tests should be passing. The fact that we're flaking frequently is news to me, but I haven't been spending much time in PRs this week.
Per SQLite docs it looks like internal concurrency issue across the same DB connection, which I thought their locking would handle... Looking further.
Just looking at the very first int log I could find today, here ya go, search in-page for "database is locked". (Weird, though, those tests pass. The "UNIQUE constraint failed" errors are hard though. I assume you know all about those.
We have some pretty clear concurrency issues on display here. The UNIQUE constraint bits look like a DB creation race that I thought would be impossible, but should be fairly easy to code around. The DB locking one is far more problematic. @vrothberg From my reading here, we may need to start using separate DB connections within the same Podman process for each access, or otherwise introduce some sort of locking to prevent more than 1 access to the DB at a time, because the locking on the connection doesn't seem to be sufficient to prevent this sort of thing.
I'm sorry. These errors are everywhere, constant, so I genuinely thought they were part of the WIP sqlite migration. Didn't even bother to file flakes for them. For the locking, would transactions help?
The issue appears to be that SQLite is allowing a write transaction to fire at the same time as another thread is reading the same table, resulting in errors. I don't think we want to wrap all reads in transactions for perf reasons, unfortunately.
The confusing bit is that this seems to contradict the sqlite docs, which say that by default all accesses through a single connection are sequential only. Simplest fix might just be added a mutex to control access and ensure that remains true.
@vrothberg and I spent a while discussing conf options this week, including the .d possibility. The perl script isn't needed for that... but it will be needed at some point for testing containers.conf overrides. I'll follow up on Monday.
Using .d should "just work" in CI even when a test sets a custom containers.conf via CONTAINERS_CONF.
Using
.dshould "just work" in CI even when a test sets a custom containers.conf via CONTAINERS_CONF.
Actually not. Whenever CONTAINERS_CONF is set, only this file will be loaded and all system/user configs will be ignored.
We've been talking about CONTAINERS_CONF_D or something where all of the normal CONTAINERS_CONF processing would be done, but the value of then environment variable would be processed at the end, perhaps it is time to add that.
We've been talking about CONTAINERS_CONF_D or something where all of the normal CONTAINERS_CONF processing would be done, but the value of then environment variable would be processed at the end, perhaps it is time to add that.
I am working on that at the moment :^)
[+0666s] not ok 217 events log-file rotation
This test will likely fail because of CONTAINERS_CONF foo.
[+0801s] not ok 285 podman userns=auto in config file
This one as well.
@edsantiago, can we skip these tests for now with sqlite until we extended the containers.conf loading mechanism?
I've fixed the userns test (it was buggy). It should pass on my next push here. logrotate, sure, I can skip on my next push.
I'm a little concerned about the remote timing results. Not enough to actually do anything about it, just concerned enough to log for future evaluation.
| type | distro | user | DB | local | remote | container |
|---|---|---|---|---|---|---|
| int | debian-12 | root | 27:42 | 38:59 | ||
| int | fedora-36 | root | 32:28 | 34:56 | 33:13 | |
| int | fedora-37 | root | 27:14 | 34:00 | 27:23 | |
| int | fedora-37 | root | sqlite | 29:46 | 33:49 | |
| int | debian-12 | rootless | 27:14 | |||
| int | fedora-36 | rootless | 30:18 | |||
| int | fedora-37 | rootless | 28:18 | |||
| int | fedora-37 | rootless | sqlite | 28:52 | ||
| sys | debian-12 | root | 26:18 | 17:43 | ||
| sys | fedora-36 | root | 30:19 | 20:32 | ||
| sys | fedora-37 | root | 29:25 | 19:24 | ||
| sys | fedora-37 | root | sqlite | 28:37 | 19:35 | |
| sys | fedora-37-aarch64 | root | 32:40 | 22:35 | ||
| sys | debian-12 | rootless | 27:54 | |||
| sys | fedora-36 | rootless | 30:27 | |||
| sys | fedora-37 | rootless | 27:36 | 18:01 | ||
| sys | fedora-37 | rootless | sqlite | 27:22 |
@cevich heads-up there's something weird going on with Cirrus or Github. Like, this job says it took 1h1m to run, but truth is 32m. And this other job is claiming to still be running, at 39m, but that's super improbable. I bet it's finished but something is delaying somewhere. Ah, and in the minutes it took me to type this, the job finished.
Anyhow, seems to be happening on a lot of jobs today. Easy to catch with my cirrus-pr-timing script, but unfortunately not easy to spot in the Github UI.
Restarted the two failed jobs (infra flake with something-something-java).
With _sync=FULL, and with Eventually() (i.e. reverting my Expect() change): f36 root and debian rootless flaked. Sorry.
diff --git a/libpod/sqlite_state.go b/libpod/sqlite_state.go
index 014edd0f462c..a44b5a966088 100644
--- a/libpod/sqlite_state.go
+++ b/libpod/sqlite_state.go
@@ -33,7 +33,7 @@ const (
// Deal with timezone automatically.
sqliteOptionLocation = "_loc=auto"
// Force WAL mode to fsync after each transaction (https://www.sqlite.org/pragma.html#pragma_synchronous).
- sqliteOptionSynchronous = "&_sync=FULL"
+ sqliteOptionSynchronous = "&_sync=EXTRA"
// Allow foreign keys (https://www.sqlite.org/pragma.html#pragma_foreign_keys).
sqliteOptionForeignKeys = "&_foreign_keys=1"
// Make sure that transactions happen exclusively.
That would be a last resort. If that doesn't work either, I will get out of the rabbit hole and follow another theory.
@vrothberg I'm keeping this alive for many reasons: ~~WAL testing~~ (no longer needed), sqlite testing (still desired), and because disabling flakeAttempts has been so useful at identifying hidden flakes (i.e., getting us closer to solving #18188). Title changed to reflect its new purpose.
Thank you, @edsantiago!
Given bug week is around the corner, please remind everybody which issues you like to see fixed. Cc: @mheon