micromasters icon indicating copy to clipboard operation
micromasters copied to clipboard

Evenly distribute enrollment/certificate/grade refreshes over time

Open rhysyngsun opened this issue 3 years ago • 0 comments

Currently we have a bit of a brittle setup with how we sync these, and recently we saw major issues with these when upstream rate limiting was introduced. We sync these every 6 hours, but we perform those requests as quickly as we can, this means we can easily (and unintentionally) cause upstream servers to get overwhelmed. Similar to how you would load balance request load over a set of servers, we should load balance these operations over time.

  • [ ] Add redbeat, so celery cron jobs run reliably
  • [ ] Distribute syncing operations over time

Approach

It's probably not the most practical to litter time.sleep() everywhere, so a middle ground is to bucket sets of users into smaller groups synced at certain intervals. We'd run a task at intervals determined by the number of buckets. That task would run sync the users in the bucket for that time slot. We'd probably need to account for some kind of failure-healing too. There are a few options I can see on bucketing:

Modulo Bucketing

A simple implementation of this would probably be to assign users into buckets with something like user.id % num_buckets, which would work, but it'd mean whenever we change the number of buckets (increasing reduces request density), every user's time slot would get shuffled, which means some users would get synced less frequently for a bit, while others would get synced more frequently.

Hash Bucketing

A better option would be to make the timing of syncs consistent even across bucket sizing changes so that the user's data is syncing on a consistent interval. This is a method used in distributed systems to spread load, which is basically what we're doing, but over time. We'd do this by bucketing the users with consistent hashing, which will give us a fairly uniform distribution while giving each user a deterministic time slot when we sync. This works by hashing a value (user.id for us) and then assigning that hash to a bucket. In our case, we'd take the integer value of that hash and normalize it to a time of day. Here's some Jupyter notebook code that does this for a sequence of 100k integer ids:

import hashlib
from matplotlib import pyplot as plt
import numpy as np

%matplotlib inline

hashes = np.array([
    int.from_bytes(hashlib.md5(str(x).encode("utf-8")).digest(), 'big')
    for x in range(100000)
], dtype=np.float)

seconds_in_day = 24 * 60 * 60

hashes /= float(2**128) # md5 is 128-bit, normalize the values to a 0..1 range
hashes *= seconds_in_day

bucket_size_seconds = 60 * 15
num_buckets = seconds_in_day // bucket_size_seconds

plt.hist(hashes, bins=num_buckets, range=(0,seconds_in_day))

This plotted pretty uniformly: index (x-axis is seconds in day)

rhysyngsun avatar Sep 03 '20 13:09 rhysyngsun