pgcat icon indicating copy to clipboard operation
pgcat copied to clipboard

Cancellations seem to have bugs

Open JelteF opened this issue 2 years ago • 4 comments
trafficstars

Hi, I'm one of the PgBouncer maintainers and this looks like a really cool project!

Since I recently solved some really confusing and hard to find bug in PgBouncer I wanted to check if this same bug was solved here too. But after taking a quick look at the code, it seems like it has the same bug that PgBouncer had. The pgbouncer change (with description of the issue) I'm talking about is this one: https://github.com/pgbouncer/pgbouncer/pull/717

The regression test that PgBouncer has can be found here (I recently converted the tests to Python): https://github.com/pgbouncer/pgbouncer/blob/master/test/test_cancel.py#L85

PS. I have another cancellation related PR open on PgBouncer: https://github.com/pgbouncer/pgbouncer/pull/666 with an accompanying conference talk: https://www.youtube.com/watch?v=M585FfbboNA Load balancing across multiple pgcat instances would have this same issue that PgBouncer currently still has. It's less of an issue for you since your multicore strategy doesn't rely on multiple instances, but still if you have dedicated pgcat nodes with a TCP load balancer in front you will run into this issue.

JelteF avatar Jan 25 '23 08:01 JelteF

Hey, thanks for reporting. That's a great find!

I looked at the code and I think Pgcat handles it correctly. In this section https://github.com/levkk/pgcat/blob/main/src/client.rs#L578-L603 , Pgcat sends the cancel request directly to a specific database (bypassing the pools and the load balancing logic).

Each time a client is matched with a server, we update the mapping between the secret_key and the client here (https://github.com/levkk/pgcat/blob/main/src/client.rs#L799).

Putting these two pieces together means a client should always be able to send the correct secret key to the correct server connection.

This is just based on looking at the code so I could be wrong, It should be easy to reproduce this bug using a couple of replicas and Pgcat. I will try to reproduce the bug on Pgcat.

drdrsh avatar Jan 25 '23 15:01 drdrsh

Distributed Pgcat setups with multiple instances behind a load balancer are definitely affected by this bug because the secret keys are not shared between Pgcat instances.

drdrsh avatar Jan 25 '23 15:01 drdrsh

I don't think I explained the first issue clear enough (https://github.com/pgbouncer/pgbouncer/pull/717). It's a race condition where pgcat sends the the cancellation to the database but doesn't wait until it is handled by the database before allowing another client to use the database connection that is being cancelled:

  1. client1 sends q1
  2. pgcat sends q1 to the database
  3. client1 sends cancel
  4. pgcat receives cancel and forwards to database (but is not yet processed by database)
  5. pgcat receives response to q1
  6. client2 sends q2
  7. pgcat sends q2 over same connection as q1
  8. database processes cancel and q2 is cancelled

JelteF avatar Jan 25 '23 15:01 JelteF

I apologize, I was responding to the issue you described in the video 😓

Yep. that race condition definitely exists in Pgcat.

drdrsh avatar Jan 25 '23 15:01 drdrsh