dproofreaders icon indicating copy to clipboard operation
dproofreaders copied to clipboard

Make past_tallies a sparse table

Open cpeel opened this issue 1 year ago • 2 comments

Background

Background pulled from #1190

The past_tallies table stores historical round, user, and team statistics (current_tallies stores the current day's statistics) including the number of pages done that day (tally_delta). Once a day the site takes a snapshot of current_tallies, calculates the delta since the last snapshot, and adds a row in the past_tallies table for every [tally_type, tally_name, holder_id] tuple.

After a user has done a page in a round, they have an entry for that round in current_tallies and the site adds a new record in past_tallies for that user/round in every daily snapshot going forward. Said another way: we keep daily records of every user that has touched a round forever, even if they never proofread another page.

At pgdp.net, 24 years of this has caught up with us. The past_tallies table is 50GB in size and contains 598,611,362 rows. Of those 594,322,594 are records where tally_delta=0 -- 99%.

This PR

This PR converts past_tallies to a sparse table wherein we do not store entries where tally_delta==0. It includes an upgrade script to delete those records.

We use past_tallies in a few primary ways:

  1. To get the sum of pages done over a specific time span eg: last week, last month
    • A sparse table format where we do not store rows where tally_delta == 0 isn't a problem for this one. We just sum(tally_delta) over a time range anyway -- we don't need the zeros.
  2. To get the number of pages done at a specific time (usually "yesterday")
    • This one we could calculate from a sparse past_tallies pretty easily.
  3. To get the number of pages done every day over some span for a table or a chart
    • Without some sort of secondary table that records all of the snapshot times we can't accurately generate the data because we don't know what days are in-between the records where tally_delta <> 0. past_tallies itself isn't set up to query against timestamp and there's nothing that guarantees that every snapshot will have at least one non-zero tally_delta. (This might be practically true on pgdp.net but isn't guaranteed to be true for all sites.)
  4. In concert with current_tallies to determine how many pages they did between yesterday and today to add new entries into `past_tallies
    • We can programmatically determine this from a sparse past_tallies but we add a whole lot of computation for every snapshot in ways that just don't make sense.

This PR introduces 1 new tables:

  • tally_snapshot_times stores a timestamp of every snapshot. That's it. This one is useful for recreating charts and tables that might be filled with zeros.

And adds additional columns to another:

  • current_tallies now stores the most recent snapshot as well. This allows us to take a snapshot very quickly and then populate past_tallies. It is also used as a very fast (index-based) lookup for "yesterday's" metrics which we do very often for rounds and users. And finally it is also a fast lookup for the most recent snapshot timestamp.

past_tallies is now touched in only the following places:

  • TallyBoard->take_snapshot() -- update to only insert records where tally_delta != 0
  • TallyBoard->get_info_re_largest_delta() -- unchanged, we still return info about the largest tally_delta. It already handles the case where the tally_holder does not exist in the table.
  • TallyBoard->get_deltas() -- we initialize all possible snapshot times with zeros and fill in any non-zero entries.
  • TallyBoard->get_delta_sum() -- unchanged
  • get_site_tally_grouped() -- we initialize the snapshot times to zeros and then fill in the non-zero values from past_tallies.

Testing

Testing/validating this is going to be a bit tricky because it involves changing data. I've created a separate database on TEST that is a copy of the current one. On this copy I've run the upgrade scripts and the sandbox below is using it. There is a crontab that refreshes the sparse DB after UTC midnight so they will only have at most a day of drift.

The sandbox has a nightly cronjob to take snapshots of the database but it is currently disabled. What this means is that the database will drift throughout a day and catch up at midnight UTC.

I suggest we test this in two phases:

  1. Validate that the overall approach is sound, the non-snapshot-taking code works, and that the sparse DB sandbox matches the main TEST sandbox. This is going to be confirming that the user's "yesterday" stats are correct and that the stats (graphs and tables) match across the rounds.
  2. After we feel good with the code and the sparse format functionality I'll disable the sync and enable the snapshot crontab and we can validate that the nightly snapshot code is working like we want.

Sandbox: https://www.pgdp.org/~cpeel/c.branch/revamp-past-tallies-sandbox/

Current past_tallies sizes on TEST

# Main DB
mysql> select count(*) from past_tallies;
+----------+
| count(*) |
+----------+
|  2304294 |
+----------+
1 row in set (1.69 sec)

# Sparse DB
mysql> select count(*) from past_tallies;
+----------+
| count(*) |
+----------+
|     9027 |
+----------+
1 row in set (0.01 sec)

cpeel avatar Apr 29 '24 16:04 cpeel

I've created this testing code for a much more automated testing approach. The goal is to check that the data gathering functions in the current code against the main TEST database matches the new code against the sparse DB:

<?php
$relPath = "./pinc/";
include_once($relPath."base.inc");

include($relPath."udb_user.php");
echo "DB: $db_name\n";

foreach (get_all_current_tallyboards() as $tallyboard) {
    if ($tallyboard->holder_type != 'S') {
        continue;
    }
    echo "tally_name: {$tallyboard->tally_name}, holder_type: {$tallyboard->holder_type}\n";
    $time_of_latest_snapshot = $tallyboard->get_time_of_latest_snapshot();
    echo "    get_time_of_latest_snapshot: $time_of_latest_snapshot\n";
    echo "    get_info_from_latest_snapshot:\n";
    dump_array($tallyboard->get_info_from_latest_snapshot(1));
    echo "    get_info_re_largest_delta:\n";
    dump_array($tallyboard->get_info_re_largest_delta(1));
    $timestamps = [
        [0, $time_of_latest_snapshot + 100],
        [1430438400, $time_of_latest_snapshot + 100],
        [1639958400, $time_of_latest_snapshot + 100],
        [1430438400, 1639958400],
    ];
    echo "    get_delta_sum:\n";
    foreach ($timestamps as [$start, $end]) {
        echo "        $start-$end: " . $tallyboard->get_delta_sum(1, $start, $end) . "\n";
    }
    echo "    get_deltas:\n";
    dump_array($tallyboard->get_deltas(1, 0));
}

function dump_array($array)
{
    echo preg_replace("/^/m", "        ", json_encode($array, JSON_PRETTY_PRINT | JSON_NUMERIC_CHECK));
    echo "\n";
}

I ran this against both and did a diff between them:

cpeel@ip-172-31-12-95:~/public_html/c.branch/revamp-past-tallies-sandbox$ php test-Tallyboard.php  > /tmp/sparse.txt
cpeel@ip-172-31-12-95:/data/htdocs/c$ php test-Tallyboard.php > /tmp/full.txt
cpeel@ip-172-31-12-95:/data/htdocs/c$ diff /tmp/full.txt /tmp/sparse.txt
1c1
< DB: dp_utf8test
---
> DB: dp_project_shrink

cpeel avatar Apr 29 '24 16:04 cpeel

I've rebased the branch to fix the merge conflict.

I've updated the cronjob on TEST so that just before every UTC midnight the sandbox DB gets updated from the main TEST DB and the migration scripts run on it. Then just past midnight the nightly snapshot happens for the snapshot DB.

cpeel avatar May 11 '24 14:05 cpeel

These are great feedback Michael, thanks. I'll work through them tonight.

cpeel avatar May 20 '24 22:05 cpeel

Latest commit addresses the good feedback from @jmdyck.

The aforementioned validation script still shows no functional differences between the functions in master and those in this sandbox.

cpeel avatar May 22 '24 01:05 cpeel

My intent is to get this merged in before June 8th and rolled out in the June 12th deployment.

cpeel avatar May 29 '24 04:05 cpeel