materialize icon indicating copy to clipboard operation
materialize copied to clipboard

storage/sources/postgres: expected slot materialize_% to be inactive, is active

Open def- opened this issue 1 year ago • 8 comments

What version of Materialize are you using?

5ec8eb4d8d53

What is the issue?

Seen in Postgres CDC tests in https://buildkite.com/materialize/tests/builds/72831#018cf6cd-5a22-4c36-8192-cf50fa7eccd7

^^^ +++
pg-cdc.td:152:1: error: expected slot materialize_% to be inactive, is active
     |
  14 | > CREATE SECRET pgpa ... [rest of line truncated for security]
  20 |     PASSWORD SECRET  ... [rest of line truncated for security]
  29 |     PASSWORD SECRET  ... [rest of line truncated for security]
  34 | $ postgres-execute c ... [rest of line truncated for security]
 151 | #
 152 | $ postgres-verify-sl ... [rest of line truncated for security]
     | ^

I haven't seen this failure before.

ci-regexp: to be inactive, is active ci-apply-to: pg-cdc

def- avatar Jan 11 '24 08:01 def-

Further occurrence: https://buildkite.com/materialize/tests/builds/72839#018cf7a5-a7f1-4ea7-940e-f1285977da2c

nrainer-materialize avatar Jan 11 '24 08:01 nrainer-materialize

I think this might come from https://github.com/MaterializeInc/materialize/pull/24161

def- avatar Jan 11 '24 08:01 def-

The issue here is that sources running on the default cluster don't signal to the upstream PG cluster that the replication slot is inactive as quickly as those whose clusters drop. This means we are more likely to leak replication slots for dropped sources if they run on the default cluster. I need some time to see if there's a better way to solve this (e.g. shortening the timeout from which PG should hear from us) rather than just lengthening the test timeout + the time we spend trying to drop the replication slot.

sploiselle avatar Jan 11 '24 19:01 sploiselle

This happened again: https://buildkite.com/materialize/tests/builds/73204#018d148c-a028-46b6-9d36-896597bebe0b

guswynn avatar Jan 16 '24 23:01 guswynn

Is it possible that this is a race like: https://github.com/MaterializeInc/materialize/issues/24396 ?

guswynn avatar Jan 17 '24 00:01 guswynn

@guswynn I don't think so because the prior test awaits the replication slot to be created, so it's very unlikely we're allowing ourselves to recreate it after dropping it.

sploiselle avatar Jan 17 '24 13:01 sploiselle

I'm still seeing this happen fairly frequently on main. The last three builds in a row all hit it.

https://buildkite.com/materialize/tests/builds/73413#018d1de1-75f0-4fdf-85f4-12efc7f92378 https://buildkite.com/materialize/tests/builds/73406#018d1dd2-fe65-4891-84e9-90f6abc62c09 https://buildkite.com/materialize/tests/builds/73404#018d1db7-9b8d-44ac-bb1b-04520514b634

danhhz avatar Jan 18 '24 19:01 danhhz

The cause of this issue is a well known and understood race condition in source rendering. @petrosagg has designed a solution to fix the issue here, though I'm not sure if there's anywhere on GitHub to link to.

Here's the history where this occurs:

  1. Testdrive executes test/pg-cdc/alter-source.td
  2. Creates mz_source_too, which sends a command to create an ingestion.
  3. Perform some other operations that succeed very quckly.
  4. Drop mz_source_too, which drops its replication slot––only the replication slot does not exist.
  5. The command to create the source gets scheduled on the Timely cluster.
  6. When rendering the dataflow, it creates the replication slot.
  7. Tests continue to execute until we check for active replication slots in pg-cdc.td and fail.

@guswynn was right that this is the same issue at #24396; I had originally mistakenly assumed it was a test that had executed more recently.

To resolve the CI flakes, I'm just going to disable this check for now.

sploiselle avatar Jan 21 '24 22:01 sploiselle