fix(mysql): RFC Handle missing server-side prepared statements
Does your PR solve an issue?
First of all, this is not merge-able as-is, it's a request for feedback/advice.
We're using sqlx with AWS Aurora, and have noticed an issue with Aurora's "zero-downtime" restarts, which wipe prepared statements but preserve TCP connections. In this scenario, all existing connections get effectively poisoned without notice, breaking queries until they age out eventually. This has forced us to disable query caching, incurring quite a big performance overhead.
I've had a look at how ActiveRecord, our other main stack handles this, and they wipe the client-side cache and retry if they get back an error. While retrying right away would be nice to rescue the query, getting the connection back into a functional state for future queries is the next best thing, and rather easy.
The main problem is testing the change. I've included a second commit with a working, but bad integration test setup. It's bad because it needs to expose additional public interfaces for the test to enable deleting the server-side prepared statements while leaving the client-side cache intact, something that the library API should not allow for consistency. I feel like an integration test is the better way to test this scenario, but neither sqlx nor MySQL expose a way to set up the required conditions. Specifically, statements prepared via the binary protocol are not named, and thus cannot be deallocated via DEALLOCATE PREPARED.
I would like to invite feedback & advice on:
- the fix in the first place, I think it's a valid fix, albeit for a niche situation
- a better way to test this
- assuming the above are sorted, specifics of the fix
If this or a similar approach gets accepted, the same change might also be needed for Postgres, I suspect it will exhibit the same issue.
Is this a breaking change?
Assuming all the issues get resolved, no, I don't think it would be breaking.
The worst case outcome I can think of is a false positive match on the error, which would wrongly throw away the statement cache and cause re-preparation. It might be nice to match more closely on the specific error, but given the state of this change, this is a functional PoC.
I'm not hostile to this change, but I would consider the wiping of prepared statements to be a huge bug on Aurora's part. We've run into issues with "compatible" providers not handling prepared statements correctly before, e.g. PgBouncer, and in many cases they've actually agreed it was a bug/misfeature and fixed it.
We can merge a fix for this, but I'd like to see at least a little pressure to get this fixed upstream.
I agree, it's very surprising behaviour, and there seems to be no way of opting out whatsoever if you're using Aurora. I'm happy to open a support ticket with AWS on our organisation's behalf/complain to our account manager, but I'm afraid we're not significant enough to be in a position to actually exert any meaningful pressure on AWS.
Fwiw, in the meantime we're using a cheap query in before_acquire which detects the same issue by virtue of being prepared itself, so when the zero-downtime patch occurs that query fails and incrementally wipes the connection pool. The nice part about this is that it actually rescues the query without any need for a retry, but at the cost of an extra round-trip for every acquire.
I'm happy to open a support ticket with AWS on our organisation's behalf/complain to our account manager, but I'm afraid we're not significant enough to be in a position to actually exert any meaningful pressure on AWS.
That's fine, it generally takes many people complaining over time. As long as they're made aware of it. It's just become my pet peeve that all these third-party implementations advertise themselves as "compatible" with a database's wire protocol but then break spectacularly when you try to use any feature that isn't part of the basic query flow.