Change past_tallies to a sparse format
The past_tallies table stores historical round, user, and team statistics (current_tallies stores the current day's statistics) including the number of pages done that day (tally_delta). Once a day the site takes a snapshot of current_tallies, calculates the delta since the last snapshot, and adds a row in the past_tallies table for every [tally_type, tally_name, holder_id] tuple.
After a user has done a page in a round, they have an entry for that round in current_tallies and the site adds a new record in past_tallies for that user/round in every daily snapshot going forward. Said another way: we keep daily records of every user that has touched a round forever, even if they never proofread another page.
At pgdp.net, 24 years of this has caught up with us. The past_tallies table is 50GB in size and contains 598,611,362 rows. Of those 594,322,594 are records where tally_delta=0 -- 99%.
We need to convert the table to a sparse format where we only store rows where tally_delta <> 0 and adjust all of the code accordingly. This will drastically reduce the size of the table and improve our DB backup times considerably.
We use past_tallies in a few primary ways
- To get the sum of pages done over a specific time span eg: last week, last month
- A sparse table format where we do not store rows where
tally_delta== 0 isn't a problem for this one. We justsum(tally_delta)over a time range anyway -- we don't need the zeros.
- A sparse table format where we do not store rows where
- To get the number of pages done at a specific time (usually "yesterday")
- This one we could calculate from a sparse
past_talliespretty easily.
- This one we could calculate from a sparse
- To get the number of pages done every day over some span for a table or a chart
- Without some sort of secondary table that records all of the snapshot times we can't accurately generate the data because we don't know what days are in-between the records where
tally_delta<> 0.past_talliesitself isn't set up to query againsttimestampand there's nothing that guarantees that every snapshot will have at least one non-zerotally_delta. (This might be practically true on pgdp.net but isn't guaranteed to be true for all sites.)
- Without some sort of secondary table that records all of the snapshot times we can't accurately generate the data because we don't know what days are in-between the records where
- In concert with
current_talliesto determine how many pages they did between yesterday and today to add new entries into `past_tallies- We can programmatically determine this from a sparse
past_talliesbut we add a whole lot of computation for every snapshot in ways that just don't make sense.
- We can programmatically determine this from a sparse
I propose introducing 2 new tables:
CREATE TABLE `latest_snapshot_tallies` (
`tally_name` char(2) NOT NULL default '',
`holder_type` char(1) NOT NULL default '',
`holder_id` int(6) unsigned NOT NULL default '0',
`timestamp` int(10) unsigned NOT NULL default '0',
`tally_delta` int(8) NOT NULL default '0',
`tally_value` int(8) NOT NULL default '0',
PRIMARY KEY (`tally_name`,`holder_type`,`holder_id`)
);
CREATE TABLE `tally_snapshot_times` (
`timestamp` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`timestamp`)
);
latest_snapshot_tallies stores the most recent snapshot -- including tally_delta=0 entries -- for every holder. There is only one entry for every holder however, the timestamp changes at every snapshot. This table can be used with current_tallies to calculate new past_tallies records. It can also be used as a very fast (index-based) lookup for "yesterday's" metrics which we do very often for rounds and users. And finally it is also a fast lookup for the most recent snapshot timestamp.
tally_snapshot_times just stores a timestamp of every snapshot. That's it. This one is useful for recreating charts and tables that might be filled with zeros.
I've got a theoretically working (read: coded but untested) version of the above coded up and it looks like this will work for all of the main code use-cases as well as noncvs use-cases.
Resolved with https://github.com/DistributedProofreaders/dproofreaders/pull/1199
Follow-up: this decreased the number of rows in the past_tallies table at pgpd.net:
Before: 598,611,362 rows
Now: 4,320,420 rows
And after an optimize, the size of the tablespace file:
Before: 53,464MB (53GB)
After: 293MB