notifications-api Delete nightly letters fixes

Best reviewed commit-by-commit

The commits:

reduce query limit for deleting leters nightly 223f270e68611f4fb7e0e51fc344512750a6edb6

we're concerned that we can't delete 50,000 letters from s3, and move their notifications, within the two minute time during celery shutdown between a worker being told gracefully and firmly (SIGTERM and SIGKILL).

reduce the limit to 5k for letters only, since they have to fetch all the notifications into python to look at s3

trigger multiple tasks to delete large amounts rather than looping ce5c5e2015d03d16e32e79e87eb34742ccf80b42

this has the advantage that if the worker executing the task downscales, then we won't lose work - instead, we'll just queue up a new task to be executed by another worker.

only delete test notifications once c2e4fc2f7c2f4a06b3410bcf65fe35e6a9049679

after the refactor to multiple tasks we would have ended up deleting test notifications every single time we hit the query limit.

now we just do it when the move task returns 0 - ie all the real notifications have already been processed by earlier tasks

ensure we delete the same files from s3 as we do from the database 25ee022308b71299d12c6982426f6081a0d7fb8e

we were not ordering our queries previously - this meant that when we fetched 5000 letters to delete from s3, and then later in the task fetched 5000 letters to move from notifications to notification_history, we wouldn't be guaranteed the same 5000. With all the thrashing of the db as we're deleting across every service all at once, it's hard to reason about what postgres may or may not cache.

if these two queries returned different things, we'd end up in a state where we may delete things from the database before we've deleted them from s3 - but without them in the notifications table we've got no way of being able to find them in s3! There's a 90 day aws lifecycle policy preventing anything from sticking around for too long but still it's not great practice.

order both by created_at (asc) to ensure we always delete the oldest first. this also means if the task fails the oldest things are gone, which seems like a sensible way to do it.

Oct 04 '24 12:10 leohemsted

Looking at the existing code comment, it seems like we intentionally didn't order the notifications when getting 50,000 to delete. It doesn't say why, but I was wondering if there is any risk that parallel tasks could be trying to get the same notifications to delete if the task always gets the oldest

Oct 04 '24 15:10 klssmith

I was wondering if there is any risk that parallel tasks could be trying to get the same notifications to delete if the task always gets the oldest

It would make sense - I think we ended up doing something similar when deleting old rows from notification_history table late last year/early this year?

Oct 04 '24 16:10 CrystalPea

quoth the postgres docs:

If ORDER BY is not given, the rows are returned in whatever order the system finds fastest to produce

this will likely be "whichever notifications were last read", and is almost certainly going to be exactly the same for two queries running concurrently, since one will read them into postgres's cache and the second will just get the same data

so i don't think not ordering will protect us from anything. but could anything go wrong right now even with the order?

this is only an issue if there are duplicate tasks running for the same service and type - this shouldn't happen by default as we only trigger each task at the end of the previous task.

Below is a real deep dive into me staring at postgres lock descriptions. this is theoretical, describing what could go wrong if we ran the same task twice, regardless of whether we get the same dataset or not

if there are duplicate tasks running, i could see the following happening:

both task both grab the same 5k letters
both tasks both try and delete those letters from s3
- lots of warnings here across both tasks, as files may not be found as they've just been deleted by the other task

then both tasks move on

both tasks select into their own temp table. they'll both grab the same rows happily
both tasks insert into notification_history from their own temp table
- we have ON CONFLICT DO NOTHING here, so if they're conflicting they'll do not much
both tasks try and delete from notifications - they'll acquire ROW EXCLUSIVE locks on each row they want to delete - whichever one wins will do the delete, and the other one will just move on as the delete statements just say DELETE FROM notifications WHERE id IN (...) so it'll be fine if the id isn't in the table at all.

so basically, i think there'll be a lot of churn but there shouldn't be any issues. we shouldn't have duplicate tasks anyway 🤔. i don't think it'll result in anything being deleted that shouldn't be but theres probs weird edge cases i'm missing, especially if the temp table selects return different values.

The other option is to SELECT FOR UPDATE on that first delete-from-s3 select - so those rows are locked and the second task simply gets a different five thousand (and deletes those files). however i'm not entirely sure if the second select to create the temporary table. from testing it looks like it'll get the same five thousand - however, if older notifications suddenly appeared (perhaps another task for that same service/type had an error and rolled back midway)? then the second query will get the oldest five thousand, ignoring if those are in the select for update or not.

really, i think the actual solution is to split the letter and non-letter flow completely:

for letters:

save the list of notifications from the letter_delete with for update
create the temp table from a "where id in [list of notifications from before]

for emails/sms:

create the table table from the current select, but with for update included

Oct 08 '24 13:10 leohemsted