snapd many: fix iface static attrs not properly updating

Similar to https://bugs.launchpad.net/snapd/+bug/1825883 (content) and https://bugs.launchpad.net/snapd/+bug/1942266 (system-files), custom-device, shared-memory, and posix-mq (edit: and more! mount-control, snapd-control, dbus, pretty much any hardware interface, and probably others all may have static attrs) interfaces don't properly update on refresh if the static attributes change. This MR addresses that, and attempts to create a more general solution than the previous approach.

Most of the delta here is just tweaking existing interfaces to inherit from commonInterface, as that was just as easy as defining a new member function for each of the interfaces that doesn't inherit from commonInterface (and maybe will be helpful in the future)

Jun 07 '23 21:06 alexclewontin

Codecov Report

Merging #12878 (24656fb) into master (81ce92a) will increase coverage by 0.09%. Report is 154 commits behind head on master. The diff coverage is 70.00%.

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the GitHub App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master   #12878      +/-   ##
==========================================
+ Coverage   78.69%   78.79%   +0.09%     
==========================================
  Files        1000     1011      +11     
  Lines      124254   125701    +1447     
==========================================
+ Hits        97779    99041    +1262     
- Misses      20321    20446     +125     
- Partials     6154     6214      +60

Flag	Coverage Δ
unittests	`78.79% <70.00%> (+0.09%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
overlord/ifacestate/helpers.go	`74.53% <69.81%> (-0.55%)`	:arrow_down:
overlord/ifacestate/handlers.go	`62.92% <71.42%> (-0.14%)`	:arrow_down:

... and 73 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Jun 07 '23 22:06 codecov-commenter

There is a test failure in spread that looks real and I can reproduce it locally:

$ spread -debug -v qemu-nested:ubuntu-16.04-64:tests/main/interface-update-on-refresh
...
+ echo 'and ensure the snap service is still active'
and ensure the snap service is still active
+ tests.systemd wait-for-service -n 30 --state active snap.go-example-webserver.webserver.service
tests.systemd: service snap.go-example-webserver.webserver.service did not become active

qemu-nested:ubuntu-16.04-64 .../tests/main/upgrade-from-2.15# journalctl -u snap.go-example-webserver.webserver.service|tail
Sep 06 13:48:25 autopkgtest go-example-webserver.webserver[20514]: listen tcp :8081: socket: permission denied
Sep 06 13:48:25 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Main process exited, code=exited, status=1/FAILURE
Sep 06 13:48:25 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Unit entered failed state.
Sep 06 13:48:25 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Failed with result 'exit-code'.

qemu-nested:ubuntu-16.04-64 .../tests/main/upgrade-from-2.15# snap connections
Interface     Plug                               Slot           Notes
network       go-example-webserver:network       :network       -
network-bind  go-example-webserver:network-bind  :network-bind  -

qemu-nested:ubuntu-16.04-64 .../tests/main/upgrade-from-2.15# grep '# Description: Can access the network as a server' /var/lib/snapd/apparmor/profiles/snap.go-example-webserver.webserver
# Description: Can access the network as a server.
# Description: Can access the network as a server.
...
I'm still investigating

Sep 06 '23 15:09 mvo5

I'm able to reproduce this on my workstation. Recording my observations here:

it appears to be race-y, as I've both observed this failure and observed it succeed on my local machine
on at least my latest run, I get a different set of logs than @mvo5:

qemu:ubuntu-16.04-64 /var/lib/snapd/apparmor/profiles# journalctl -u snap.go-example-webserver.webserver.service|tail
Sep 06 11:37:58 autopkgtest systemd[1]: Started Service for snap application go-example-webserver.webserver.
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Main process exited, code=killed, status=11/SEGV
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Unit entered failed state.
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Failed with result 'signal'.
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Service hold-off time over, scheduling restart.
Sep 06 11:37:58 autopkgtest systemd[1]: Stopped Service for snap application go-example-webserver.webserver.
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Start request repeated too quickly.
Sep 06 11:37:58 autopkgtest systemd[1]: Failed to start Service for snap application go-example-webserver.webserver.
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Unit entered failed state.
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Failed with result 'start-limit-hit'.

it also appears to be failing in different places for me and for @mvo5. Mine fails after using snapd 2.15 to install go-example-webserver, before migrating to the latest snapd deb.

+ snap install go-example-webserver
907.85 KB / 1.77 MB  50.21 % 643.06 KB/s 1sservices

go-example-webserver (stable) 16.04-9 from 'canonical' installed
++ uname -r
+ [[ 4.4.0-210-generic != 4.4.* ]]
+ tests.systemd wait-for-service -n 30 --state active snap.go-example-webserver.webserver.service
tests.systemd: service snap.go-example-webserver.webserver.service did not become active

I'm also seeing lots of

[Wed Sep  6 11:00:47 2023] audit: type=1400 audit(1694012447.778:34): apparmor="DENIED" operation="file_mmap" profile="snap.go-example-webserver.webserver" name="/usr/lib/snapd/snap-exec" pid=3956 comm="snap-exec" requested_mask="m" denied_mask="m" fsuid=0 ouid=0

Not sure if that is related.

Sep 06 '23 15:09 alexclewontin

Dove back into this (cc @Meulengracht for when you get to it, no rush) and the salient thing here is that when run locally with a qemu backend, I get the same failure (as recorded above) with this branch and with master, which makes me think that there is some additional failure with my local run that is masking anything the test might actually be showing.

If I comment out the if [[ "$(uname -r)" != 4.4.* ]]; then and fi lines then it passes in both cases. Don't know enough about what this is actually doing/working around to know what this might mean.

Of course, with the google backend, the failure is different.

Jan 25 '24 23:01 alexclewontin

@Meulengracht is this ready for landing?

Apr 05 '24 13:04 zyga

@Meulengracht is this ready for landing?

We need to verify that test pass. This needs to be rebased again probably and then verify test-failures

Apr 05 '24 13:04 Meulengracht

Hi @Meulengracht,

What is the plan to merge this PR? Are we targeting the 2.64 since 2.63 is already released?

May 28 '24 11:05 bugraaydogar

I rebased on latest master.

May 31 '24 14:05 ernestl

We still need to investigate why google:ubuntu-16.04-64:tests/main/upgrade-from-2.15 fails in this PR - marking it blocked untill it's taken a look at

Jun 11 '24 11:06 Meulengracht

I did an investigation into the failing test, it seems that after the snapd upgrade, tasks are scheduled to reconnect the interfaces of go-example-webserver snap, which is great and all fine, but it seems before these snap connections are made, that the webservice service is trying to continously restart, and then exit with failure due to missing permissions (trying to bind to tcp:8081).

The moment the snap connections have been reconnected by the automatic task, then the service is able to start, but unfortunately the restart limit by systemd is hit before this, and thus the service must be manually started.

I do not understand why this PR changes the flow here, and I have not looked into what exactly happens before this PR

Jun 20 '24 10:06 Meulengracht

Agreed to move this to 2.65

Jul 01 '24 13:07 ernestl

@Meulengracht based on what you said above, my best guess is that there's a race condition between the apps starting and the interface connections being made. Because this PR changes things to do a more thorough policy check when connections are reloaded, the connections being reloaded now takes longer, so isn't complete by the time the app starts, where as previously it was.

This would also explain something I observed early on:

it appears to be race-y, as I've both observed this failure and observed it succeed on my local machine

Does this sound broadly sane?

Jul 01 '24 16:07 alexclewontin

Per discussion with @pedronis and @Meulengracht to kill tests/main/upgrade-from-2.15 test

Jul 04 '24 13:07 bboozzoo

The PR has been updated and rebased on top of master.

Jul 04 '24 13:07 bboozzoo

snapd snapd copied to clipboard

many: fix iface static attrs not properly updating

Codecov Report

snapd
snapd copied to clipboard