snapd
snapd copied to clipboard
many: fix iface static attrs not properly updating
Similar to https://bugs.launchpad.net/snapd/+bug/1825883 (content) and https://bugs.launchpad.net/snapd/+bug/1942266 (system-files), custom-device, shared-memory, and posix-mq (edit: and more! mount-control, snapd-control, dbus, pretty much any hardware interface, and probably others all may have static attrs) interfaces don't properly update on refresh if the static attributes change. This MR addresses that, and attempts to create a more general solution than the previous approach.
Most of the delta here is just tweaking existing interfaces to inherit from commonInterface, as that was just as easy as defining a new member function for each of the interfaces that doesn't inherit from commonInterface (and maybe will be helpful in the future)
Codecov Report
Merging #12878 (24656fb) into master (81ce92a) will increase coverage by
0.09%
. Report is 154 commits behind head on master. The diff coverage is70.00%
.
:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the GitHub App Integration for your organization. Read more.
@@ Coverage Diff @@
## master #12878 +/- ##
==========================================
+ Coverage 78.69% 78.79% +0.09%
==========================================
Files 1000 1011 +11
Lines 124254 125701 +1447
==========================================
+ Hits 97779 99041 +1262
- Misses 20321 20446 +125
- Partials 6154 6214 +60
Flag | Coverage Δ | |
---|---|---|
unittests | 78.79% <70.00%> (+0.09%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Files Changed | Coverage Δ | |
---|---|---|
overlord/ifacestate/helpers.go | 74.53% <69.81%> (-0.55%) |
:arrow_down: |
overlord/ifacestate/handlers.go | 62.92% <71.42%> (-0.14%) |
:arrow_down: |
... and 73 files with indirect coverage changes
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
There is a test failure in spread that looks real and I can reproduce it locally:
$ spread -debug -v qemu-nested:ubuntu-16.04-64:tests/main/interface-update-on-refresh
...
+ echo 'and ensure the snap service is still active'
and ensure the snap service is still active
+ tests.systemd wait-for-service -n 30 --state active snap.go-example-webserver.webserver.service
tests.systemd: service snap.go-example-webserver.webserver.service did not become active
qemu-nested:ubuntu-16.04-64 .../tests/main/upgrade-from-2.15# journalctl -u snap.go-example-webserver.webserver.service|tail
Sep 06 13:48:25 autopkgtest go-example-webserver.webserver[20514]: listen tcp :8081: socket: permission denied
Sep 06 13:48:25 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Main process exited, code=exited, status=1/FAILURE
Sep 06 13:48:25 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Unit entered failed state.
Sep 06 13:48:25 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Failed with result 'exit-code'.
qemu-nested:ubuntu-16.04-64 .../tests/main/upgrade-from-2.15# snap connections
Interface Plug Slot Notes
network go-example-webserver:network :network -
network-bind go-example-webserver:network-bind :network-bind -
qemu-nested:ubuntu-16.04-64 .../tests/main/upgrade-from-2.15# grep '# Description: Can access the network as a server' /var/lib/snapd/apparmor/profiles/snap.go-example-webserver.webserver
# Description: Can access the network as a server.
# Description: Can access the network as a server.
...
I'm still investigating
I'm able to reproduce this on my workstation. Recording my observations here:
- it appears to be race-y, as I've both observed this failure and observed it succeed on my local machine
- on at least my latest run, I get a different set of logs than @mvo5:
qemu:ubuntu-16.04-64 /var/lib/snapd/apparmor/profiles# journalctl -u snap.go-example-webserver.webserver.service|tail
Sep 06 11:37:58 autopkgtest systemd[1]: Started Service for snap application go-example-webserver.webserver.
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Main process exited, code=killed, status=11/SEGV
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Unit entered failed state.
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Failed with result 'signal'.
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Service hold-off time over, scheduling restart.
Sep 06 11:37:58 autopkgtest systemd[1]: Stopped Service for snap application go-example-webserver.webserver.
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Start request repeated too quickly.
Sep 06 11:37:58 autopkgtest systemd[1]: Failed to start Service for snap application go-example-webserver.webserver.
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Unit entered failed state.
Sep 06 11:37:58 autopkgtest systemd[1]: snap.go-example-webserver.webserver.service: Failed with result 'start-limit-hit'.
it also appears to be failing in different places for me and for @mvo5. Mine fails after using snapd 2.15 to install go-example-webserver, before migrating to the latest snapd deb.
+ snap install go-example-webserver
907.85 KB / 1.77 MB 50.21 % 643.06 KB/s 1sservices
go-example-webserver (stable) 16.04-9 from 'canonical' installed
++ uname -r
+ [[ 4.4.0-210-generic != 4.4.* ]]
+ tests.systemd wait-for-service -n 30 --state active snap.go-example-webserver.webserver.service
tests.systemd: service snap.go-example-webserver.webserver.service did not become active
I'm also seeing lots of
[Wed Sep 6 11:00:47 2023] audit: type=1400 audit(1694012447.778:34): apparmor="DENIED" operation="file_mmap" profile="snap.go-example-webserver.webserver" name="/usr/lib/snapd/snap-exec" pid=3956 comm="snap-exec" requested_mask="m" denied_mask="m" fsuid=0 ouid=0
Not sure if that is related.
Dove back into this (cc @Meulengracht for when you get to it, no rush) and the salient thing here is that when run locally with a qemu backend, I get the same failure (as recorded above) with this branch and with master, which makes me think that there is some additional failure with my local run that is masking anything the test might actually be showing.
If I comment out the if [[ "$(uname -r)" != 4.4.* ]]; then
and fi
lines then it passes in both cases. Don't know enough about what this is actually doing/working around to know what this might mean.
Of course, with the google backend, the failure is different.
@Meulengracht is this ready for landing?
@Meulengracht is this ready for landing?
We need to verify that test pass. This needs to be rebased again probably and then verify test-failures
Hi @Meulengracht,
What is the plan to merge this PR? Are we targeting the 2.64 since 2.63 is already released?
I rebased on latest master.
We still need to investigate why google:ubuntu-16.04-64:tests/main/upgrade-from-2.15
fails in this PR - marking it blocked untill it's taken a look at
I did an investigation into the failing test, it seems that after the snapd upgrade, tasks are scheduled to reconnect the interfaces of go-example-webserver snap, which is great and all fine, but it seems before these snap connections are made, that the webservice service is trying to continously restart, and then exit with failure due to missing permissions (trying to bind to tcp:8081).
The moment the snap connections have been reconnected by the automatic task, then the service is able to start, but unfortunately the restart limit by systemd is hit before this, and thus the service must be manually started.
I do not understand why this PR changes the flow here, and I have not looked into what exactly happens before this PR
Agreed to move this to 2.65
@Meulengracht based on what you said above, my best guess is that there's a race condition between the apps starting and the interface connections being made. Because this PR changes things to do a more thorough policy check when connections are reloaded, the connections being reloaded now takes longer, so isn't complete by the time the app starts, where as previously it was.
This would also explain something I observed early on:
- it appears to be race-y, as I've both observed this failure and observed it succeed on my local machine
Does this sound broadly sane?
Per discussion with @pedronis and @Meulengracht to kill tests/main/upgrade-from-2.15
test
The PR has been updated and rebased on top of master.