Deduplicate Scratch.context
I asked our pal, ChatGPT, to give me a script to be able to determine whether it makes sense to try and de-duplicate contexts. TL;DR: yes, it'll save ~15GB of space (uncompressed).
Script
import hashlib
from django.db import transaction
from coreapp.models.scratch import Scratch # replace with your actual app/model name
# --- Config ---
context_field = 'context' # the TextField name that holds your shared text
chunk_size = 1000 # how many rows to process per DB fetch
# ---------------
stats = {}
total_bytes = 0
row_count = 0
print("Scanning contexts...")
# use iterator() to stream rows without loading everything into memory
with transaction.atomic():
for s in Scratch.objects.only(context_field).iterator(chunk_size=chunk_size):
text = getattr(s, context_field, None)
if not text:
continue
b = text.encode('utf-8')
size = len(b)
h = hashlib.sha256(b).hexdigest()
total_bytes += size
row_count += 1
entry = stats.get(h)
if entry:
entry['count'] += 1
else:
stats[h] = {'size': size, 'count': 1}
# compute totals
unique_bytes = sum(v['size'] for v in stats.values())
dedup_savings = 1 - (unique_bytes / total_bytes) if total_bytes else 0
print("\n--- Deduplication Estimate ---")
print(f"Rows scanned: {row_count:,}")
print(f"Unique contexts: {len(stats):,}")
print(f"Total raw size: {total_bytes / 1_000_000:.2f} MB")
print(f"Unique text size: {unique_bytes / 1_000_000:.2f} MB")
print(f"Potential savings: {dedup_savings * 100:.2f}%")
Results (on my laptop which has a slightly older version of the db (158k scratches not the current 196k):
Rows scanned: 158,225
Unique contexts: 68,473
Total raw size: 33902.49 MB
Unique text size: 16119.61 MB
Potential savings: 52.45%
In terms of implementation, we should have a separate table for Contexts. Each scratch will point to an entry in this table. If a user modifies the context we will check if that modified version already exists, otherwise we'll create a new Context.
This means we should update the housekeeping script to cleanup un-referenced contexts (the same way it clears out ownerless scratches and duff profiles).
I think @encounter had an idea or two a while back that also would help save disk space with contexts. It involved compression and/or versioning with protobuf if I remember? Anyway that might be better as a later thing, once we move to a rust backend. this seems like a much lower fruit
Yeh, doing something clever with diffs/deltas is beyond my IQ but definitely makes sense longer term.
Postgres does compression of TEXT fields behind the scenes (around 70% from the handful of contexts I looked at), but deduplication should still save a decent amount of space!