podman CI: system tests: something is eating volumes

Seeing flakes in |600| podman shell completion test when run in parallel:

$ podman __completeNoDesc something
...
# #| expected: '.*v-t361-kugfvnq4
(but no, it is not there!)

The trick is finding the culprit. There are very few run_podman volume rm commands in system tests, it's easy to check them individually, and I don't see any way for any of them to be responsible. Journal shows nothing useful.

debian-13 : sys podman debian-13 root host sqlite
- PR #23275
  - 08-21 19:49 in [sys] |600| podman shell completion test
rawhide : sys podman rawhide rootless host sqlite
- PR #23275
  - 09-10 08:08 in [sys] |600| podman shell completion test

x	x	x	x	x	x
sys(2)	podman(2)	rawhide(1)	root(1)	host(2)	sqlite(2)
		debian-13(1)	rootless(1)

Sep 10 '24 12:09 edsantiago

Seen on my laptop:

✗ |252| quadlet - volume [6509]
...
Sep 17 04:41:59 f.edsantiago.com systemd[2113]: Finished basic_t205-xzbhuwxt-volume.service.
   $ systemctl show --value --property=ActiveState basic_t205-xzbhuwxt-volume.service
   active
   
   [04:41:59.601800049] $ bin/podman volume ls
   [04:41:59.647905558] Error: no such volume

Sep 17 '24 10:09 edsantiago

That second one looks like a TOCTOU in the volume ls code, similar to the ones Paul fixed recently around containers and pods. Probably unrelated to the first.

Sep 17 '24 17:09 mheon

A friendly reminder that this issue had no activity for 30 days.

Oct 18 '24 00:10 github-actions[bot]

Yesterday, f40 aarch64 root

Oct 24 '24 11:10 edsantiago

Well we have the cleanup tracer now so we can see all podman commands that were executed and I am still baffled. I am ignoring everything before the logged podman volume create v-t388-2tpdzsr0 and then I just filtered for volume or prune but still no command logged that would explain it. There is a volume rm v-t407-e5qoxja3 logged but that is clearly another volume so at that point I can only suspect some other command (that really should not delete the volume) is deleting it

Oct 29 '24 18:10 Luap99

Thanks for the confirmation. There are very few volume rm commands in system tests, and even fewer prune, and I have satisfied myself (99%) that none of those can possibly be causing this. What else could eat a volume?

Oct 29 '24 18:10 edsantiago

I have no idea, there is podman rm -v/--volumes but per description that should only delete anonymous volumes (and the ones created by volume create are never anonymous). Also it will only delete volumes attached to the removed container...

There is no easy explanation for me to blame the tests, so I have to assume we might have a bug in podman itself where unrelated things get deleted.

Oct 29 '24 18:10 Luap99

Ok this is not what I expected: https://api.cirrus-ci.com/v1/artifact/task/6059057566449664/html/sys-podman-rawhide-rootless-host-sqlite.log.html https://api.cirrus-ci.com/v1/artifact/task/6059057566449664/cleanup_tracer/podman-cleanup-tracer.log

No match for my magic Special issue 23913 volume deleted by pid string in the tracer log, what is going on here... The tracer matches all unlink() and unlickat() system calls AFAICT and there is not a single process that tried to remove that file if the tracer works reliably (I cannot be 100% on that). So that means we lost the volume from the db only? And then leaked the path? I created more questions than answers...

Nov 04 '24 16:11 Luap99

Hm. If we assume that the volume is still there, and (please oh please) that there are no db integrity issues, could it be a bug in the query logic?

Nov 04 '24 16:11 edsantiago

Maybe? So far only sqlite is seen but we have to few hits on that flake to exclude boltdb IMO. Listing should be simple though and not somehow skip a volume.

Nov 04 '24 16:11 Luap99

I haven't seen this once since https://github.com/containers/podman/pull/24479 so I assume that was it.

Dec 13 '24 10:12 Luap99