materialize
materialize copied to clipboard
storage/sources/postgres: expected slot materialize_% to be inactive, is active
What version of Materialize are you using?
5ec8eb4d8d53
What is the issue?
Seen in Postgres CDC tests in https://buildkite.com/materialize/tests/builds/72831#018cf6cd-5a22-4c36-8192-cf50fa7eccd7
^^^ +++
pg-cdc.td:152:1: error: expected slot materialize_% to be inactive, is active
|
14 | > CREATE SECRET pgpa ... [rest of line truncated for security]
20 | PASSWORD SECRET ... [rest of line truncated for security]
29 | PASSWORD SECRET ... [rest of line truncated for security]
34 | $ postgres-execute c ... [rest of line truncated for security]
151 | #
152 | $ postgres-verify-sl ... [rest of line truncated for security]
| ^
I haven't seen this failure before.
ci-regexp: to be inactive, is active ci-apply-to: pg-cdc
Further occurrence: https://buildkite.com/materialize/tests/builds/72839#018cf7a5-a7f1-4ea7-940e-f1285977da2c
I think this might come from https://github.com/MaterializeInc/materialize/pull/24161
The issue here is that sources running on the default cluster don't signal to the upstream PG cluster that the replication slot is inactive as quickly as those whose clusters drop. This means we are more likely to leak replication slots for dropped sources if they run on the default cluster. I need some time to see if there's a better way to solve this (e.g. shortening the timeout from which PG should hear from us) rather than just lengthening the test timeout + the time we spend trying to drop the replication slot.
This happened again: https://buildkite.com/materialize/tests/builds/73204#018d148c-a028-46b6-9d36-896597bebe0b
Is it possible that this is a race like: https://github.com/MaterializeInc/materialize/issues/24396 ?
@guswynn I don't think so because the prior test awaits the replication slot to be created, so it's very unlikely we're allowing ourselves to recreate it after dropping it.
I'm still seeing this happen fairly frequently on main. The last three builds in a row all hit it.
https://buildkite.com/materialize/tests/builds/73413#018d1de1-75f0-4fdf-85f4-12efc7f92378 https://buildkite.com/materialize/tests/builds/73406#018d1dd2-fe65-4891-84e9-90f6abc62c09 https://buildkite.com/materialize/tests/builds/73404#018d1db7-9b8d-44ac-bb1b-04520514b634
The cause of this issue is a well known and understood race condition in source rendering. @petrosagg has designed a solution to fix the issue here, though I'm not sure if there's anywhere on GitHub to link to.
Here's the history where this occurs:
- Testdrive executes
test/pg-cdc/alter-source.td - Creates
mz_source_too, which sends a command to create an ingestion. - Perform some other operations that succeed very quckly.
- Drop
mz_source_too, which drops its replication slot––only the replication slot does not exist. - The command to create the source gets scheduled on the Timely cluster.
- When rendering the dataflow, it creates the replication slot.
- Tests continue to execute until we check for active replication slots in
pg-cdc.tdand fail.
@guswynn was right that this is the same issue at #24396; I had originally mistakenly assumed it was a test that had executed more recently.
To resolve the CI flakes, I'm just going to disable this check for now.