snapd i/p/requestprompts: add timeout for prompts

When there has been no retrieval of prompt details for a given user after a particular duration, expire all outstanding prompts for that user, as this suggests that there is no prompt client running for that user.

If a user retrieves prompt details or interacts with prompts in some way, such as by retrieving all prompts, a prompt by ID, or attempting to reply to a prompt, bump the timeout to a much longer timeout, since this indicates that a prompt client is running for that user.

This is tracked internally by https://warthogs.atlassian.net/browse/SNAPDENG-28665

Aug 20 '24 05:08 olivercalder

This doesn't strictly need to land for 2.65, but up to you @ernestl whether you'd like to include it, if it can be approved quickly.

CC @sminez :)

Aug 20 '24 05:08 olivercalder

race checker isn't happy:

==================
WARNING: DATA RACE
Read at 0x000000b954a0 by goroutine 52:
  github.com/snapcore/snapd/interfaces/prompting/requestprompts.(*PromptDB).AddOrMerge.func1()
      /home/runner/work/snapd/snapd/src/github.com/snapcore/snapd/interfaces/prompting/requestprompts/requestprompts.go:312 +0x130

Previous write at 0x000000b954a0 by goroutine 36:
  github.com/snapcore/snapd/interfaces/prompting/requestprompts_test.(*requestpromptsSuite).TestPromptExpiration.MockInitialTimeout.Backup[go.shape.int64].func4()
      /home/runner/work/snapd/snapd/src/github.com/snapcore/snapd/testutil/base.go:70 +0x35
  runtime.deferreturn()
      /snap/go/10698/src/runtime/panic.go:605 +0x5d
  runtime.call16()
      /snap/go/10698/src/runtime/asm_amd64.s:775 +0x42
  reflect.Value.Call()
      /snap/go/10698/src/reflect/value.go:365 +0xb5
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/runner/work/snapd/snapd/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:775 +0x965
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/runner/work/snapd/snapd/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:669 +0xe9

Goroutine 52 (running) created at:
  time.goFunc()
      /snap/go/10698/src/time/sleep.go:215 +0x44

Goroutine 36 (finished) created at:
  gopkg.in/check%2ev1.(*suiteRunner).forkCall()
      /home/runner/work/snapd/snapd/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:666 +0x5ba
  gopkg.in/check%2ev1.(*suiteRunner).forkTest()
      /home/runner/work/snapd/snapd/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:757 +0x155
  gopkg.in/check%2ev1.(*suiteRunner).runTest()
      /home/runner/work/snapd/snapd/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:812 +0x419
  gopkg.in/check%2ev1.(*suiteRunner).run()
      /home/runner/work/snapd/snapd/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/check.go:618 +0x3c6
  gopkg.in/check%2ev1.Run()
      /home/runner/work/snapd/snapd/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/run.go:92 +0x44
  gopkg.in/check%2ev1.RunAll()
      /home/runner/work/snapd/snapd/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/run.go:84 +0x124
  gopkg.in/check%2ev1.TestingT()
      /home/runner/work/snapd/snapd/src/github.com/snapcore/snapd/vendor/gopkg.in/check.v1/run.go:72 +0x5d3
  github.com/snapcore/snapd/interfaces/prompting/requestprompts_test.Test()
      /home/runner/work/snapd/snapd/src/github.com/snapcore/snapd/interfaces/prompting/requestprompts/requestprompts_test.go:44 +0x26
  testing.tRunner()
      /snap/go/10698/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /snap/go/10698/src/testing/testing.go:1743 +0x44
==================
Found 1 data race(s)

Aug 20 '24 07:08 bboozzoo

Per discussion with Innes, we'd like to pursue (1), where snapd recognizes a client being active by it retrieving prompt details and/or replying to prompts. A dedicated ACK endpoint would also work, but then there would need to be extra communication and synchronization, within the client, within snapd, and between the client and snapd. At the moment, this isn't necessary, and would add complexity.

Sep 10 '24 22:09 olivercalder

Rebasing to resolve conflicts with interfaces/prompting/requestprompts/requestprompts.go

Sep 12 '24 16:09 olivercalder

Changed to 2.67 after discussion, this requires more time.

Sep 20 '24 13:09 ernestl

I've added dedicated clientActivity fields which get passed through from the daemon to the ifacestate manager to the requestprompts backend. For now, the daemon has hard-coded which endpoints qualify as client activity (namely, getPrompt, getPrompts, and postPrompt), but in the future we could expand this in one of a few ways:

Clients could pass an argument to the HTTP request indicating whether to count the request as client activity
Snapd could (potentially) check the origin of the connection, and if it's from an app which is registered as a handler service, it will treat the request as client activity

Another question is whether to count client activity which strictly touches the rules backend (e.g. getting a rule, as opposed to adding a rule, the latter of which results in a pass over outstanding prompts to see if any are now matched) as client activity. For now, we do not. If we did, we'd probably want to add a dedicated method on the prompts backend which simply bumps the timeout without doing anything else, since those API requests wouldn't otherwise have any effect on the prompts.

For now, though, I think this is sufficient for our use case, as the official prompting client never accesses, adds, or modifies rules directly, and requests to the prompt-related endpoints are treated as client activity. Please let me know if this sounds good to you @sminez

Sep 21 '24 18:09 olivercalder

Codecov Report

Attention: Patch coverage is 96.18321% with 5 lines in your changes missing coverage. Please review.

Project coverage is 79.05%. Comparing base (96ea7b0) to head (1a9a804). Report is 132 commits behind head on master.

Files with missing lines	Patch %	Lines
...erfaces/prompting/requestprompts/requestprompts.go	95.53%	4 Missing and 1 partial :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #14390      +/-   ##
==========================================
+ Coverage   78.95%   79.05%   +0.10%     
==========================================
  Files        1084     1087       +3     
  Lines      146638   147949    +1311     
==========================================
+ Hits       115773   116957    +1184     
- Misses      23667    23757      +90     
- Partials     7198     7235      +37

Flag	Coverage Δ
unittests	`79.05% <96.18%> (+0.10%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Sep 21 '24 18:09 codecov[bot]

@olivercalder that all sounds good from the client side :+1: I agree that hitting the rules endpoints shouldn't bump the timeout in any way so lets leave things as you describe in your most recent comment. If we do want to make things smarter in future I'd prefer it were done by snapd determining that the request came from a "real" client rather than the client claiming that it is "real" as that then opens a whole can of worms I suspect. Presumably this can be done statically by setting up the interface for prompting in the right way?

Sep 23 '24 11:09 sminez

Rebased to resolve merge conflicts and pull in some test fixes from master.

Oct 21 '24 17:10 olivercalder

Adding the timer interface and fake timer to timeutil more than doubled the size of this PR. Perhaps that commit needs to be spun off into its own PR and landed first?

Oct 23 '24 05:10 olivercalder

This PR now depends on https://github.com/canonical/snapd/pull/14672

Oct 24 '24 22:10 olivercalder

Now that the mockable timers PR #14672 has landed, this PR has been rebased and is ready for review.

Nov 14 '24 14:11 olivercalder

@olivercalder there are some outstanding comments, can you have a look?

Nov 15 '24 09:11 bboozzoo

I accidentally force pushed away some changes built on the mockable timers a few weeks ago.... thankfully I can retrieve them, but ouch. Final straw pushing me to always use git worktree from now on, no more mess of out-of-sync cloned repos.

Nov 26 '24 23:11 olivercalder

Spotted something in one of my notes. When we persist prompts across snapd restarts, it may be beneficial to have a longer initial timeout for prompts when snapd is first started, in case startup takes some time and the client doesn't have a chance to respond before the short initial timeout. Perhaps manually setting timeouts for existing user prompt DBs to activityTimeout during the prompt load() function (which does not yet exist) will suffice.

Dec 02 '24 20:12 olivercalder

Test failures:

google:ubuntu-20.04-64:tests/unit/go:clang --- settle is not converging, but only on clang, not sure why
openstack:fedora-41-64:tests/upgrade/basic --- problem with mount type squashfs
google-distro-2:opensuse-tumbleweed-64:tests/main/microk8s-smoke:edge
google-distro-2:opensuse-tumbleweed-64:tests/main/prepare-image-validation-sets

These are all unrelated (prompting doesn't even run on any of these systems) so I think this PR is safe to merge.

Dec 03 '24 23:12 olivercalder