Pause Deadlocks/database timeouts
I was able to reproduce the pause deadlocks and maybe the db timeouts on a pause docker image.
I believe the core problem is this:
Inside of a transaction (that's maaaaaaaaaaaybe a bit too long running) we do:
INSERT INTO primeur ...
And another process does this (outside of a transaction):
DELETE FROM perms WHERE LOWER(package) = LOWER(?)
The LOWER(package) prevents mysql from using an index lookup, which means it scans the entire table/index and locks way more than it needs to, and gets stuck behind the transaction with the INSERT.
I think to solve this we need to either...
- Get rid of all duplicate rows due to historical bad data, then modify the column to have default mysql collation which will mean 'X' = 'x' in searches so we can do plain index lookups
- Add another column "package_lower" which is the same as package but always lowercase, and use that
- Do a select before any update and delete to retrieve the correct case package names and then do the updates/deletes with that data
Option 1 is probably best, 2 is okay, 3 sucks
Along with this... we should:
- find a way to shorten the transactions so the mysql modifying bits only happen when necessary
- Move
$dio->normalize_package_casing;into the transaction in lib/PAUSE/mldistwatch.pm
It not being in the transaction means when a delete fails on primeur, the delete that removes perms happened already, and that's what leads to https://github.com/andk/pause/issues/372
We are going with option 1 which should negate the deadlocks almost entirely (except when folks updoad dists that share package names around the same time)
https://github.com/wolfsage/pause/tree/no-more-deadlocks (need to clean up and write a deployment plan but tests all pass)
MR opened!
I'm working through the duplicates, contacting authors to explain the problem and asking if we can remove old releases and do the other things needed. This is part of @wolfsage's option 1 above.
Andreas and I have worked through all the duplicates, and by either deleting old releases (with the author's permission), or deleting the old entry from the index, we are now duplicate-free.
What's the next step here?
We believe this should now be fixed and the issue can be closed. Mixed deployment of #440 and #424 has happened just a few minutes ago.
Hooray!