securedrop icon indicating copy to clipboard operation
securedrop copied to clipboard

Simplify "seen" logic in SecureDrop Client by adding an "update_seen" boolean field in the `sources` table

Open zenmonkeykstop opened this issue 8 months ago • 3 comments

Description

SecureDrop Client has performance issues with large numbers of sources, in the 500+ range. Based on profiling results (and ignoring network I/O), updating the source list after a server sync takes about 75% of elapsed time, and within that, checking the seen status of sources takes over 50%. This manifests in the UX as a noticeable freeze in updates to the source list immediately following a sync.

Currently, seen status for a source is checked by looping through a collection of a source's submissions and replies, checking if each has been seen by any journalist, and if any item has not been seen, returning False, otherwise True. This happens for every source in the list.

Replies always count as seen (because the sending journalist saw them), so only messages and file submissions matter. If a boolean was set to False when a source first submits something, that value could be used, instead of looping through the collection. This would effectively eliminate a large chunk of the GUI freeze, improving user experience for instances with large numbers of sources

Server-side changes would be required to:

  • add a new update_seen field in the database
  • update it to false on source submissions
  • update it to true when journalist seen_by entries are added for files and messages.
  • add the field to the relevant endpoints in the journalist API

Client-side changes would be required to:

  • drop the seen property in the db model
  • add the new field in the db and sdk
  • deal with possible edge cases where a source submits while their conversation is being viewed.

How will this impact SecureDrop users?

  • Improvement in UX for users on high-volume instances

How would this affect SecureDrop's threat model?

The value of the field could be inferred by an adversary with database access just by looking at the seen_by statuses available in the database already - so no change.

zenmonkeykstop avatar May 14 '25 20:05 zenmonkeykstop

It may not be necessary to change the server-side schema and update logic to achieve this with reasonable performance. This SQL query (I had AI help, but tweaked it a bit) seems to run quite efficiently over hundreds of sources and appears to have the intended effect, with is_read set to 1 if none of the associated submissions are unseen. With .timer on after loaddata.py insertion of 350 sources, my results were:

Run Time: real 0.023 user 0.023855 sys 0.000034

SELECT
    src.journalist_designation,
    CASE
        WHEN  NOT EXISTS (
                 SELECT 1
                 FROM submissions  AS sub
                 WHERE sub.source_id = src.id
                   /* for this submission there is neither a seen-file nor a seen-message row */
                   AND NOT EXISTS (SELECT 1 FROM seen_files    sf WHERE sf.file_id    = sub.id)
                   AND NOT EXISTS (SELECT 1 FROM seen_messages sm WHERE sm.message_id = sub.id)
              )
        THEN 1    -- every item is accounted for → is_read = TRUE
        ELSE 0    -- at least one un-seen thing remains → is_read = FALSE
    END  AS is_read

FROM sources AS src;

eloquence avatar May 15 '25 00:05 eloquence

Nice - i wonder if the ORM is what's making this slow elsewhere.

zenmonkeykstop avatar May 15 '25 13:05 zenmonkeykstop

@zenmonkeykstop will undertake an initial investigation this sprint; if server-side changes are required, we will discuss.

eloquence avatar Jun 03 '25 18:06 eloquence

I got the same concept that @eloquence proposed (batch loading with JOINs/inner SELECTs) working in https://github.com/freedomofpress/securedrop/pull/7604/commits/75dc5e25487909c7d3f83d7fb4f88ba053b08299 to pretty good performance. Getting the seen state to fit in that model was actually the hardest part, which is what reminded me of this (also today's earlier conversation).

Now that we're treating Sources as the unit of fetching from the API, it seems like it'll be much easier/faster to calculate these types of aggregate properties. In my APIv2 branch, adding a new Source.seen property should be as simple as:

diff --git a/securedrop/models.py b/securedrop/models.py
index 288d7a1aa..1351c205a 100644
--- a/securedrop/models.py
+++ b/securedrop/models.py
@@ -136,7 +136,7 @@ class Source(db.Model):
 
         # TODO: verify this doesn't lazy-load anything
         starred = bool(self.star and self.star.starred)
-
+        collection = [item.to_json_v2() for item in self.collection]
         return {
             "uuid": self.uuid,
             "journalist_designation": self.journalist_designation,
@@ -144,7 +144,8 @@ class Source(db.Model):
             "last_updated": last_updated,
             "public_key": self.public_key,
             "fingerprint": self.fingerprint,
-            "collection": [item.to_json_v2() for item in self.collection],
+            "collection": collection,
+            "seen": all(bool(item["seen_by"]) for item in collection),
         }

and that doesn't register at all in my naive/quick benchmarking.

I would suggest we get this working in an optimal state in APIv2/new client and then in a month or so, depending on our progress, revisit this on whether backporting it to the old client is worthwhile.

legoktm avatar Jul 08 '25 20:07 legoktm

As the path forward becomes clearer, I would suggest we open a tightly scoped change for the server-side change that we want, and close this issue. @legoktm Maybe that initial scoping (not the implementation) could be considered explicitly for the next sprint?

eloquence avatar Jul 30 '25 19:07 eloquence

Closing in favor of #7621, which I'm going to try as part of learning day.

legoktm avatar Jul 31 '25 21:07 legoktm