looped-in More privacy-conscious URL analysis

Currently, the extension sends all visited URLs to hn.algolia.com to see how many HN comments/stories there are. This caused some privacy concerns: see this comment and this comment for well-reasoned examples.

This comment gives a possible solution: keep a local copy of all HN-submitted URLs and number of comments/stories to avoid making the network request at all. This would require serious optimizations to be feasible - storing the hash of the URLs instead of the full URL would be a good start.

The extension could periodically run a script to download that information and update its database (probably stored in localStorage).

Feb 06 '18 18:02 jdormit

And this comment gives a clear summary of the proposed solution and possible implementation.

Feb 06 '18 18:02 jdormit

Unless I'm being dense, it looks like the Google BigQuery HN dataset has not been updated since 2015: https://bigquery.cloud.google.com/savedquery/440056610598:89ba3d2b9023407c8c440e93b8a9df6d. So that won't be a good data source to fetch all HN submitted stories.

I think the only feasible option is to use the Firebase API to do a one-time download of all HN stories and then subscribe to the real-time updates to keep that DB up to date. This would have to happen on a server since the extension is only running when the user's browser is open. The extension can then query the server over HTTP to download a local copy of all URL hashes/number of comments.

The one-time download algorithm is something like: use the maxitem Firebase endpoint to get the highest item ID, then request every lower item ID by using the /items/{id} endpoint counting backward from the max item. This should be done concurrently obviously.

The extension's local copy can be modeled as an append-only log keyed by HN item id. That way it only needs to request new data that it doesn't already have it stored.

So here are the steps to take:

Write a server that performs and one-time download of all existing HN stories and then subscribes to the real-time updates.
Store this data (specifically the item ID, number of comments, and the hash of the URL) in a database. SQLite would be a good choice.
Expose the data via an HTTP endpoint. The endpoint should to a query parameter with biggest item ID already possessed by the caller so that it can return only results bigger than that. Depending on the size of the response payload, it may be prudent to compress the response payload.
Add functionality to the extension to use the new infrastructure. The extension needs to keep track of the biggest itemID it knows about, and every 5 minutes (and on startup) should request all stories bigger than that from the server. This data should be stored in localStorage. This data can then be used to populate the browser action text. The Algolia API will still be used to fetch the actual contents of the submitted item, but only when the browser action button is clicked.

Feb 09 '18 12:02 jdormit

And one final note - all URLs should be canonicalized (i.e. remove any query parameters) before hashing.

Feb 09 '18 12:02 jdormit

So, I've been thinking about this more and I think it's actually feasible to do all this work in the extension without needing a server. The basic idea is to do a one-time background download of the HN data the first time the extension is loaded. Thereafter, the extension can use the maxitem firebase endpoint and the JavaScript firebase SDK to catch itself up to the latest data on load and to subscribe to real-time updates while it is loaded.

The data should still be stored in an append-only log keyed by the item id, and URLs should still be canonicalized and hashed. One possible optimization is to figure out if there is a way to figure out the type of an item without downloading the whole item. That way the extension can only do the full download for story-type items (or even better, just download the number of comments and URL of story-type items).

Feb 14 '18 00:02 jdormit