securedrop Journalist API with hash-based versioning

Towards freedomofpress/securedrop-client#2462, this is a proposal and proof of concept for extending the Journalist API with hash-based versioning. I have no expectation that this pull request will ever be merged. It's a first draft for discussion: what more do we want to research, refine, and prototype?

Rendered PROPOSAL.md

[x] First review: 8a9924ded7eda7da57ffdbee05f293a0d9964bc0
[x] Second review: 23167852a0b1143a97feadd767828699584bed20
[ ] Specification

Jun 12 '25 06:06 cfm

I think this is a really great starting point; some asks for a refined proposal:

Can we measure gzip-compressed bytes? JSON should compress pretty well so I think it'll give us a more realistic picture of what we're transmitting through Tor (I think just taking the raw JSON that requests transmits and running it through gzip will be good enough)
Document that pagination can happen client side (but not planned for server side)
I am curious if you considered using an etag instead of the global hash. I don't think it changes anything really in terms of sync itself, but that was my first thought when reading it, and if they are the same, then I think that strengthens the case for it, as it's an established pattern.
We discussed a little bit about optimizing the API response for the GUI, by lazy-loading new/recently updated sources first. I don't think we should do that now, but I think mentioning it as a future possibility and double checking we're not locking ourselves out of that would be good.
My understanding is that the client will track the global hash, plus the index (hashes of sources). If correct, that makes sense to me, I was wondering if we want the client to be able to independently calculate hashes of sources. I ask because I think that has slight implications on the persistence/storage side of things about storing the raw JSON output that the server gives.
I'm not yet sold on having every operation be blocking, especially given Tor latencies. I like the principle you articulated of "Optimize for the steady state where there's nothing to do." Inspired by that, I would suggest, "Optimize for the non-error case", which would imply to assume that when the user sends a reply, the server will accept it. I think it is worth exploring/documenting what (non-UX) complications we have if we end up having a queue of things to send to the server.
Regarding persistence, the Qt client has some sanitization passes over the server's response, plus converting it into its own models, plus some updates of seen state. Right now the proposal is just saving the server's JSON into SQLite and then triggering a UI refresh? Do we want/need to keep some kind of security/sanitization pass?

Jun 12 '25 19:06 legoktm

Responding to you fully but somewhat out of order, @legoktm:

[x] Can we measure gzip-compressed bytes? JSON should compress pretty well so I think it'll give us a more realistic picture of what we're transmitting through Tor (I think just taking the raw JSON that requests transmits and running it through gzip will be good enough)

Yes: done in 4bb554f0628f0bc349b7aa1b9f6a1b7de83d09e3 and benchmarked in be04a5b6870770151eaab88a311a6763d6353ae9.

[x] I am curious if you considered using an etag instead of the global hash. I don't think it changes anything really in terms of sync itself, but that was my first thought when reading it, and if they are the same, then I think that strengthens the case for it, as it's an established pattern.

Great idea: done in 63c4c8842fc5b2314f913a4da6716797e1b34bbb.

[x] Document that pagination can happen client side (but not planned for server side)

[x] We discussed a little bit about optimizing the API response for the GUI, by lazy-loading new/recently updated sources first. I don't think we should do that now, but I think mentioning it as a future possibility and double checking we're not locking ourselves out of that would be good.

In response to @eloquence's urging of this possibility in Slack, I've sketched a different approach in 8dfbd225f52e69a210e374c7c1dd4e5c9d9db172.

[x] My understanding is that the client will track the global hash, plus the index (hashes of sources). If correct, that makes sense to me, I was wondering if we want the client to be able to independently calculate hashes of sources. I ask because I think that has slight implications on the persistence/storage side of things about storing the raw JSON output that the server gives.

It's a good question. When I aggregated source-level versioning in 54e8de80514f37023b4393243fc88b5aea215468, I eliminated the version column, and now the client (in theory, per bdaf5a4524c181c3c42488d5c86fd80c3a90aa8c) calculates and versions a full index just like the Server's. This would allow or require (depending on your point of view) us to enforce the following properties:

Client's Source.data strictly equals the Server's JSON response for that source.
Client's Item.data strictly equals the Server's JSON response for that submission or reply.
The Client is of course free to enrich its local data with SQLite columns outside the data JSON column.

The alternative is to go back to saving the Server's returned version along with the JSON response. But I think would actually add implementation complexity, lose a nice reconciliation/recovery mechanism, and lose some conceptual clarity—for no real benefit I can see. If performance is a concern, the same optimizations are possible on the client as on the Server, e.g.:

Recalculate the index on start-up.
Use SQLAlchemy event listeners to recalculate the index on writes (rare) and just use the cached version on reads (common).

[ ] I'm not yet sold on having every operation be blocking, especially given Tor latencies. I like the principle you articulated of "Optimize for the steady state where there's nothing to do." Inspired by that, I would suggest, "Optimize for the non-error case", which would imply to assume that when the user sends a reply, the server will accept it. I think it is worth exploring/documenting what (non-UX) complications we have if we end up having a queue of things to send to the server.

I hear you. As usual, I'm starting from a somewhat extreme position for the purpose of discussion. :-) We do so little writing compared to reading that I want to keep this mechanism as simple as possible on both ends. And my hypothesis is that client writes will reach the server much faster, and therefore feel much more responsive, when the Tor circuit is less clogged with sync traffic.

In particular, I want to avoid recreating data races between incoming sync and outgoing writes. That suggests the batching mechanism I've proposed in dfbb7a651cf340b03ca564244aa4ba4f38c78cad.

[x] Regarding persistence, the Qt client has some sanitization passes over the server's response, plus converting it into its own models, plus some updates of seen state. Right now the proposal is just saving the server's JSON into SQLite and then triggering a UI refresh? Do we want/need to keep some kind of security/sanitization pass?

That's the naïve implementation. Whether or not we define an OpenAPI specification for the Journalist API as a whole, JSON Schemas would let the client validate server-supplied JSON resources before we persist them to SQLite.

Jun 13 '25 23:06 cfm

Done as of 3239a3027bc00506549863df7ec61742386f4bb2:

[x] A semi-[literate] reference implementation in Python of the structures and algorithms for versioning and diffing resources

[x] An initial set of test vectors

[x] A scaffold (i.e., schemas and stubs) for the endpoints the new API provides

Remaining:

[x] Merge PROPOSAL.md into api2.py so that the overall sync process as well as the server-side specification is self-contained and self-documenting

Jun 26 '25 19:06 cfm

@legoktm, I'm still tinkering with the supporting changes I've proposed to the toolchain, but the specification in api2.py is ready for your review.

Jun 27 '25 14:06 cfm

Closing in favor of #7604.

Jul 09 '25 02:07 cfm