perma icon indicating copy to clipboard operation
perma copied to clipboard

Decrease size of HistoricalLink table

Open rebeccacremona opened this issue 2 years ago • 1 comments

We use Django Simple History to track changes on a number of our models. The historical link table is huge:

+---------------------------------------+--------+-------+-------+------------+---------+
| CONCAT(table_schema, '.', table_name) | rows   | DATA  | idx   | total_size | idxfrac |
+---------------------------------------+--------+-------+-------+------------+---------+
| perma.perma_historicallink            | 18.76M | 5.66G | 2.30G | 7.96G      |    0.41 |
| perma.perma_link                      | 2.16M  | 0.81G | 0.92G | 1.73G      |    1.14 |
| perma.django_session                  | 2.83M  | 1.02G | 0.20G | 1.21G      |    0.20 |
| perma.perma_uncaughterror             | 0.81M  | 1.01G | 0.02G | 1.03G      |    0.02 |
| perma.perma_historicallinkuser        | 2.64M  | 0.54G | 0.45G | 1.00G      |    0.83 |
| perma.perma_capture                   | 5.34M  | 0.73G | 0.21G | 0.94G      |    0.28 |
| perma.perma_historicalregistrar       | 1.82M  | 0.44G | 0.08G | 0.53G      |    0.19 |
| perma.perma_capturejob                | 1.77M  | 0.31G | 0.21G | 0.52G      |    0.69 |
| perma.perma_historicalorganization    | 1.73M  | 0.15G | 0.15G | 0.30G      |    1.00 |
| perma.perma_link_folders              | 2.33M  | 0.14G | 0.14G | 0.28G      |    1.06 |
...

The routine creation of a link makes 3 rows, one of which is a duplicate which I think clean_duplicate_history will remove.

Maybe let's try running that, and see how it goes? The table is so big, I'm not sure it will run painlessly, but we can hope. We probably want to run it regularly, maybe as an hourly/daily celerybeat task.

Also, Jack says, "seems totally viable to clean_old_history with some long time window, too."

rebeccacremona avatar Jan 05 '22 19:01 rebeccacremona

We're trying a customized version of the duplicate cleanup command on a clone of the prod db, to see how much it reduces the final size and how long it takes, on a db where nothing else is happening. For this experiment, we're running it as a fab task in 4 tmux sessions on one minion; in real life we probably would run on the IA minions, maybe mediated by celery, maybe not? TBD.

@task
def clean_duplicate_link_history(start_date=None, end_date=None, batch_size=None, dry_run=False):
    """
    One-time task, to clean duplicate history in our 9.3 GB, never-before-de-duped historical links table.
    Based on the Django Simple History management command, but takes a start and end date, and should use
    less memory. Adapted from https://github.com/jazzband/django-simple-history/blob/master/simple_history/management/commands/clean_duplicate_history.py#L87

    start_date and end_date are in the format YYYY-MM-DD

    Arguments should be strings, e.g.
    fab dev.clean_duplicate_link_history:batch_size="10",dry_run="True"

    """
    from datetime import datetime
    from django.db import transaction
    from tqdm import tqdm
    from perma.models import Link
    import pytz
    import re
    from simple_history.utils import get_history_manager_for_model

    #
    # Format args
    #
    if not start_date:
        # use first archive date
        start_datetime = Link.objects.order_by('creation_timestamp')[0].creation_timestamp
    elif re.match(r'^\d\d\d\d-\d\d-\d\d$', start_date):
        start_datetime = pytz.utc.localize(datetime.strptime(start_date, "%Y-%m-%d"))
    else:
        print("start_date must be in the format YYYY-MM-DD")
        return

    if not end_date:
        end_datetime = pytz.utc.localize(datetime.now())
    elif re.match(r'^\d\d\d\d-\d\d-\d\d$', end_date):
        end_datetime = pytz.utc.localize(datetime.strptime(end_date, "%Y-%m-%d"))
    else:
        print("end_date must be in the format YYYY-MM-DD")
        return

    if batch_size:
        batch_size = int(batch_size)

    dry_run = bool(dry_run)


    #
    # retrieve link ids and batch size
    #
    links = Link.objects.all_with_deleted().filter(
        creation_timestamp__gte=start_datetime,
        creation_timestamp__lt=end_datetime
    )
    if batch_size:
        links = links[:batch_size]

    to_dedupe = links.count()
    if not to_dedupe:
        print("No links in batch.")
        return

    #
    # de-dupe history
    #
    print(f"De-duping the history of {to_dedupe} links.")
    historical_link_manager = get_history_manager_for_model(Link)
    for guid in tqdm(links.values_list('guid', flat=True)):
        first_record = historical_link_manager.filter(guid=guid).first()
        if not first_record:
            print(f"No history for {guid}")
            continue

        all_history =  historical_link_manager.filter(guid=guid)

        with transaction.atomic():
            to_delete = []
            f1 = first_record
            for f2 in all_history[1:]:
                delta = f1.diff_against(f2)
                if not delta.changed_fields:
                    to_delete.append(f1.pk)
                f1 = f2
            if not dry_run:
                historical_link_manager.filter(pk__in=to_delete).delete()
            print(f"{'Found' if dry_run else 'Deleted'} {len(to_delete)} duplicate entries for {guid}.")

rebeccacremona avatar Jan 12 '22 22:01 rebeccacremona