Simplify "seen" logic in SecureDrop Client by adding an "update_seen" boolean field in the `sources` table
Description
SecureDrop Client has performance issues with large numbers of sources, in the 500+ range. Based on profiling results (and ignoring network I/O), updating the source list after a server sync takes about 75% of elapsed time, and within that, checking the seen status of sources takes over 50%. This manifests in the UX as a noticeable freeze in updates to the source list immediately following a sync.
Currently, seen status for a source is checked by looping through a collection of a source's submissions and replies, checking if each has been seen by any journalist, and if any item has not been seen, returning False, otherwise True. This happens for every source in the list.
Replies always count as seen (because the sending journalist saw them), so only messages and file submissions matter. If a boolean was set to False when a source first submits something, that value could be used, instead of looping through the collection. This would effectively eliminate a large chunk of the GUI freeze, improving user experience for instances with large numbers of sources
Server-side changes would be required to:
- add a new
update_seenfield in the database - update it to
falseon source submissions - update it to
truewhen journalist seen_by entries are added for files and messages. - add the field to the relevant endpoints in the journalist API
Client-side changes would be required to:
- drop the
seenproperty in the db model - add the new field in the db and sdk
- deal with possible edge cases where a source submits while their conversation is being viewed.
How will this impact SecureDrop users?
- Improvement in UX for users on high-volume instances
How would this affect SecureDrop's threat model?
The value of the field could be inferred by an adversary with database access just by looking at the seen_by statuses available in the database already - so no change.
It may not be necessary to change the server-side schema and update logic to achieve this with reasonable performance. This SQL query (I had AI help, but tweaked it a bit) seems to run quite efficiently over hundreds of sources and appears to have the intended effect, with is_read set to 1 if none of the associated submissions are unseen. With .timer on after loaddata.py insertion of 350 sources, my results were:
Run Time: real 0.023 user 0.023855 sys 0.000034
SELECT
src.journalist_designation,
CASE
WHEN NOT EXISTS (
SELECT 1
FROM submissions AS sub
WHERE sub.source_id = src.id
/* for this submission there is neither a seen-file nor a seen-message row */
AND NOT EXISTS (SELECT 1 FROM seen_files sf WHERE sf.file_id = sub.id)
AND NOT EXISTS (SELECT 1 FROM seen_messages sm WHERE sm.message_id = sub.id)
)
THEN 1 -- every item is accounted for → is_read = TRUE
ELSE 0 -- at least one un-seen thing remains → is_read = FALSE
END AS is_read
FROM sources AS src;
Nice - i wonder if the ORM is what's making this slow elsewhere.
@zenmonkeykstop will undertake an initial investigation this sprint; if server-side changes are required, we will discuss.
I got the same concept that @eloquence proposed (batch loading with JOINs/inner SELECTs) working in https://github.com/freedomofpress/securedrop/pull/7604/commits/75dc5e25487909c7d3f83d7fb4f88ba053b08299 to pretty good performance. Getting the seen state to fit in that model was actually the hardest part, which is what reminded me of this (also today's earlier conversation).
Now that we're treating Sources as the unit of fetching from the API, it seems like it'll be much easier/faster to calculate these types of aggregate properties. In my APIv2 branch, adding a new Source.seen property should be as simple as:
diff --git a/securedrop/models.py b/securedrop/models.py
index 288d7a1aa..1351c205a 100644
--- a/securedrop/models.py
+++ b/securedrop/models.py
@@ -136,7 +136,7 @@ class Source(db.Model):
# TODO: verify this doesn't lazy-load anything
starred = bool(self.star and self.star.starred)
-
+ collection = [item.to_json_v2() for item in self.collection]
return {
"uuid": self.uuid,
"journalist_designation": self.journalist_designation,
@@ -144,7 +144,8 @@ class Source(db.Model):
"last_updated": last_updated,
"public_key": self.public_key,
"fingerprint": self.fingerprint,
- "collection": [item.to_json_v2() for item in self.collection],
+ "collection": collection,
+ "seen": all(bool(item["seen_by"]) for item in collection),
}
and that doesn't register at all in my naive/quick benchmarking.
I would suggest we get this working in an optimal state in APIv2/new client and then in a month or so, depending on our progress, revisit this on whether backporting it to the old client is worthwhile.
As the path forward becomes clearer, I would suggest we open a tightly scoped change for the server-side change that we want, and close this issue. @legoktm Maybe that initial scoping (not the implementation) could be considered explicitly for the next sprint?
Closing in favor of #7621, which I'm going to try as part of learning day.