securethenews
securethenews copied to clipboard
Optimize database performance
Secure the News was very much designed as an "MVP," along with the mantra of "premature optimization is the root of all evil." As a result, querying Scans in the database is exceptionally slow. To see this slowness in action, load the site from the development environment and profile loading either the homepage or the leaderboard (both trigger expensive database operations).
The only reason why this slowness does not affect the production site is our extensive use of front-end caching (thanks, Cloudflare!).
There are some other issues with our database design as well. For example, we re-scan and store the full set of scan data for each site every 8 hours, even though the results of the scans change infrequently. It seems like it would be much more efficient to somehow only store the deltas of each scan, or only store a full set of scan data when the results change.
Finally, since the majority of the sites
app's data is serialized into a relational database from the JSON output of our scanning tool pshtt
, I've wondered in the past if using a database that is optimized for storing and querying JSON directly might be a better approach. During the initial development phase I wanted to avoid the complexity of using something other than the default Django ORM, so this idea was never explored in depth.
Finally: I am an amateur when it comes to database design, so I welcome any and all ideas here!
Here's one approach that would alleviate some of the sluggishness from storing the full scan results every time a site is scanned:
- In the
Scan
model, add a columnlast_scanned
- When a site is scanned, first take the most recent scan for that site and see if the scan result has changed:
- If the scan result has changed, insert a new row with
last_scanned
set to the timestamp of the scan. - If the scan result has not changed, do not insert a new row. Instead, update
last_scanned
to set it to the timestamp of the most recent scan.
- If the scan result has changed, insert a new row with
This way you are tracking changes instead of recording every scan.
So FYI I implemented this for the SecureDrop landing page scanner so I could scan all the landing pages once an hour, if at some future point the STN database gets too slow or we decide we want to scan more then we could use the same code here and do a migration