perma
perma copied to clipboard
Decrease size of HistoricalLink table
We use Django Simple History to track changes on a number of our models. The historical link table is huge:
+---------------------------------------+--------+-------+-------+------------+---------+
| CONCAT(table_schema, '.', table_name) | rows | DATA | idx | total_size | idxfrac |
+---------------------------------------+--------+-------+-------+------------+---------+
| perma.perma_historicallink | 18.76M | 5.66G | 2.30G | 7.96G | 0.41 |
| perma.perma_link | 2.16M | 0.81G | 0.92G | 1.73G | 1.14 |
| perma.django_session | 2.83M | 1.02G | 0.20G | 1.21G | 0.20 |
| perma.perma_uncaughterror | 0.81M | 1.01G | 0.02G | 1.03G | 0.02 |
| perma.perma_historicallinkuser | 2.64M | 0.54G | 0.45G | 1.00G | 0.83 |
| perma.perma_capture | 5.34M | 0.73G | 0.21G | 0.94G | 0.28 |
| perma.perma_historicalregistrar | 1.82M | 0.44G | 0.08G | 0.53G | 0.19 |
| perma.perma_capturejob | 1.77M | 0.31G | 0.21G | 0.52G | 0.69 |
| perma.perma_historicalorganization | 1.73M | 0.15G | 0.15G | 0.30G | 1.00 |
| perma.perma_link_folders | 2.33M | 0.14G | 0.14G | 0.28G | 1.06 |
...
The routine creation of a link makes 3 rows, one of which is a duplicate which I think clean_duplicate_history
will remove.
Maybe let's try running that, and see how it goes? The table is so big, I'm not sure it will run painlessly, but we can hope. We probably want to run it regularly, maybe as an hourly/daily celerybeat task.
Also, Jack says, "seems totally viable to clean_old_history
with some long time window, too."
We're trying a customized version of the duplicate cleanup command on a clone of the prod db, to see how much it reduces the final size and how long it takes, on a db where nothing else is happening. For this experiment, we're running it as a fab task in 4 tmux sessions on one minion; in real life we probably would run on the IA minions, maybe mediated by celery, maybe not? TBD.
@task
def clean_duplicate_link_history(start_date=None, end_date=None, batch_size=None, dry_run=False):
"""
One-time task, to clean duplicate history in our 9.3 GB, never-before-de-duped historical links table.
Based on the Django Simple History management command, but takes a start and end date, and should use
less memory. Adapted from https://github.com/jazzband/django-simple-history/blob/master/simple_history/management/commands/clean_duplicate_history.py#L87
start_date and end_date are in the format YYYY-MM-DD
Arguments should be strings, e.g.
fab dev.clean_duplicate_link_history:batch_size="10",dry_run="True"
"""
from datetime import datetime
from django.db import transaction
from tqdm import tqdm
from perma.models import Link
import pytz
import re
from simple_history.utils import get_history_manager_for_model
#
# Format args
#
if not start_date:
# use first archive date
start_datetime = Link.objects.order_by('creation_timestamp')[0].creation_timestamp
elif re.match(r'^\d\d\d\d-\d\d-\d\d$', start_date):
start_datetime = pytz.utc.localize(datetime.strptime(start_date, "%Y-%m-%d"))
else:
print("start_date must be in the format YYYY-MM-DD")
return
if not end_date:
end_datetime = pytz.utc.localize(datetime.now())
elif re.match(r'^\d\d\d\d-\d\d-\d\d$', end_date):
end_datetime = pytz.utc.localize(datetime.strptime(end_date, "%Y-%m-%d"))
else:
print("end_date must be in the format YYYY-MM-DD")
return
if batch_size:
batch_size = int(batch_size)
dry_run = bool(dry_run)
#
# retrieve link ids and batch size
#
links = Link.objects.all_with_deleted().filter(
creation_timestamp__gte=start_datetime,
creation_timestamp__lt=end_datetime
)
if batch_size:
links = links[:batch_size]
to_dedupe = links.count()
if not to_dedupe:
print("No links in batch.")
return
#
# de-dupe history
#
print(f"De-duping the history of {to_dedupe} links.")
historical_link_manager = get_history_manager_for_model(Link)
for guid in tqdm(links.values_list('guid', flat=True)):
first_record = historical_link_manager.filter(guid=guid).first()
if not first_record:
print(f"No history for {guid}")
continue
all_history = historical_link_manager.filter(guid=guid)
with transaction.atomic():
to_delete = []
f1 = first_record
for f2 in all_history[1:]:
delta = f1.diff_against(f2)
if not delta.changed_fields:
to_delete.append(f1.pk)
f1 = f2
if not dry_run:
historical_link_manager.filter(pk__in=to_delete).delete()
print(f"{'Found' if dry_run else 'Deleted'} {len(to_delete)} duplicate entries for {guid}.")