decomp.me icon indicating copy to clipboard operation
decomp.me copied to clipboard

Deduplicate Scratch.context

Open mkst opened this issue 1 month ago • 2 comments

I asked our pal, ChatGPT, to give me a script to be able to determine whether it makes sense to try and de-duplicate contexts. TL;DR: yes, it'll save ~15GB of space (uncompressed).

Script

import hashlib
from django.db import transaction
from coreapp.models.scratch import Scratch  # replace with your actual app/model name

# --- Config ---
context_field = 'context'   # the TextField name that holds your shared text
chunk_size = 1000             # how many rows to process per DB fetch
# ---------------

stats = {}
total_bytes = 0
row_count = 0

print("Scanning contexts...")

# use iterator() to stream rows without loading everything into memory
with transaction.atomic():
    for s in Scratch.objects.only(context_field).iterator(chunk_size=chunk_size):
        text = getattr(s, context_field, None)
        if not text:
            continue
        b = text.encode('utf-8')
        size = len(b)
        h = hashlib.sha256(b).hexdigest()
        total_bytes += size
        row_count += 1
        entry = stats.get(h)
        if entry:
            entry['count'] += 1
        else:
            stats[h] = {'size': size, 'count': 1}

# compute totals
unique_bytes = sum(v['size'] for v in stats.values())
dedup_savings = 1 - (unique_bytes / total_bytes) if total_bytes else 0

print("\n--- Deduplication Estimate ---")
print(f"Rows scanned:       {row_count:,}")
print(f"Unique contexts:    {len(stats):,}")
print(f"Total raw size:     {total_bytes / 1_000_000:.2f} MB")
print(f"Unique text size:   {unique_bytes / 1_000_000:.2f} MB")
print(f"Potential savings:  {dedup_savings * 100:.2f}%")

Results (on my laptop which has a slightly older version of the db (158k scratches not the current 196k):

Rows scanned:       158,225
Unique contexts:    68,473
Total raw size:     33902.49 MB
Unique text size:   16119.61 MB
Potential savings:  52.45%

In terms of implementation, we should have a separate table for Contexts. Each scratch will point to an entry in this table. If a user modifies the context we will check if that modified version already exists, otherwise we'll create a new Context.

This means we should update the housekeeping script to cleanup un-referenced contexts (the same way it clears out ownerless scratches and duff profiles).

mkst avatar Nov 11 '25 09:11 mkst

I think @encounter had an idea or two a while back that also would help save disk space with contexts. It involved compression and/or versioning with protobuf if I remember? Anyway that might be better as a later thing, once we move to a rust backend. this seems like a much lower fruit

ethteck avatar Nov 11 '25 10:11 ethteck

Yeh, doing something clever with diffs/deltas is beyond my IQ but definitely makes sense longer term.

Postgres does compression of TEXT fields behind the scenes (around 70% from the handful of contexts I looked at), but deduplication should still save a decent amount of space!

mkst avatar Nov 11 '25 10:11 mkst