panic at 'state diff should apply cleanly: "last_gc_req update didn't match: SeqNo(0) vs SeqNo(3)"'
What version of Materialize are you using?
main
How did you install Materialize?
Built from source
What is the issue?
Working locally, just screwing around with some LIR planning, one rebuild resulted in what appears to be a wedged state, in which environmentd cannot boot. I'm afraid I haven't captured the state to reproduce this (reportedly it would span postgres state as well as MZ and persist state).
I haven't seen this before, and --reset returns it to a booting state.
Relevant log output
mcsherry@gallustrate materialize % bin/environmentd --release
$ cargo build --release --bin storaged --bin computed --bin environmentd
Finished release [optimized + debuginfo] target(s) in 1.54s
warning: Existing computed process (PID 46040) will be reused
$ psql -At postgres://mcsherry@%2Ftmp -c 'CREATE SCHEMA IF NOT EXISTS consensus'
NOTICE: schema "consensus" already exists, skipping
CREATE SCHEMA
$ psql -At postgres://mcsherry@%2Ftmp -c 'CREATE SCHEMA IF NOT EXISTS adapter'
NOTICE: schema "adapter" already exists, skipping
CREATE SCHEMA
$ psql -At postgres://mcsherry@%2Ftmp -c 'CREATE SCHEMA IF NOT EXISTS storage'
NOTICE: schema "storage" already exists, skipping
CREATE SCHEMA
$ /Users/mcsherry/Projects/materialize/target/release/environmentd --persist-consensus-url=postgres://mcsherry@%2Ftmp?options=--search_path=consensus --persist-blob-url=file:///Users/mcsherry/Projects/materialize/mzdata/persist/blob --adapter-stash-url=postgres://mcsherry@%2Ftmp?options=--search_path=adapter --storage-stash-url=postgres://mcsherry@%2Ftmp?options=--search_path=storage --environment-id=local-az1-15aaa5c5-83bb-452f-a335-f519a493b6a8-0
booting server
environmentd v0.27.0-dev (5d62a210e)
OpenSSL 1.1.1q 5 Jul 2022
librdkafka v1.8.2
invoked as: MZ_ROOT=/Users/mcsherry/Projects/materialize /Users/mcsherry/Projects/materialize/target/release/environmentd '--persist-consensus-url=postgres://mcsherry@%2Ftmp?options=--search_path=consensus' '--persist-blob-url=file:///Users/mcsherry/Projects/materialize/mzdata/persist/blob' '--adapter-stash-url=postgres://mcsherry@%2Ftmp?options=--search_path=adapter' '--storage-stash-url=postgres://mcsherry@%2Ftmp?options=--search_path=storage' '--environment-id=local-az1-15aaa5c5-83bb-452f-a335-f519a493b6a8-0'
os: Mac OS 12.2.1 [64-bit]
cpus: 8 logical, 4 physical, 4 useful
cpu0: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz 0MHz
memory: 34359738368KB total, 26505273344KB used
swap: 2147483648KB total, 1059061760KB used
max log level: info
environmentd: 2022-09-20T19:19:03.433733Z INFO mz_adapter::catalog: create table materialize.public.part (u1)
environmentd: 2022-09-20T19:19:03.433833Z INFO mz_adapter::catalog: create table materialize.public.supplier (u2)
environmentd: 2022-09-20T19:19:03.433879Z INFO mz_adapter::catalog: create table materialize.public.partsupp (u3)
environmentd: 2022-09-20T19:19:03.439111Z INFO mz_adapter::catalog: create view materialize.public.q16 (u4)
environmentd: 2022-09-20T19:19:03.439271Z INFO mz_adapter::catalog: create index materialize.public.pk_part (u5)
environmentd: 2022-09-20T19:19:03.439312Z INFO mz_adapter::catalog: create index materialize.public.pk_partsupp (u6)
environmentd: 2022-09-20T19:19:03.439338Z INFO mz_adapter::catalog: create index materialize.public.pk_supplier (u7)
environmentd: 2022-09-20T19:19:03.439362Z INFO mz_adapter::catalog: create index materialize.public.fk_partsupp_partkey (u8)
environmentd: 2022-09-20T19:19:03.439395Z INFO mz_adapter::catalog: create index materialize.public.fk_partsupp_suppkey (u9)
environmentd: 2022-09-20T19:19:03.439421Z INFO mz_adapter::catalog: create index materialize.public.fk_supplier_nationkey (u10)
environmentd: 2022-09-20T19:19:03.439454Z INFO mz_adapter::catalog: create index materialize.public.q16_primary_idx (u11)
environmentd: 2022-09-20T19:19:03.450934Z INFO mz_adapter::catalog: create table materialize.public.part (u1)
environmentd: 2022-09-20T19:19:03.451042Z INFO mz_adapter::catalog: create table materialize.public.supplier (u2)
environmentd: 2022-09-20T19:19:03.451095Z INFO mz_adapter::catalog: create table materialize.public.partsupp (u3)
environmentd: 2022-09-20T19:19:03.457199Z INFO mz_adapter::catalog: create view materialize.public.q16 (u4)
environmentd: 2022-09-20T19:19:03.457370Z INFO mz_adapter::catalog: create index materialize.public.pk_part (u5)
environmentd: 2022-09-20T19:19:03.457407Z INFO mz_adapter::catalog: create index materialize.public.pk_partsupp (u6)
environmentd: 2022-09-20T19:19:03.457433Z INFO mz_adapter::catalog: create index materialize.public.pk_supplier (u7)
environmentd: 2022-09-20T19:19:03.457477Z INFO mz_adapter::catalog: create index materialize.public.fk_partsupp_partkey (u8)
environmentd: 2022-09-20T19:19:03.457536Z INFO mz_adapter::catalog: create index materialize.public.fk_partsupp_suppkey (u9)
environmentd: 2022-09-20T19:19:03.457570Z INFO mz_adapter::catalog: create index materialize.public.fk_supplier_nationkey (u10)
environmentd: 2022-09-20T19:19:03.457617Z INFO mz_adapter::catalog: create index materialize.public.q16_primary_idx (u11)
environmentd: 2022-09-20T19:19:03.488166Z INFO tokio_postgres::connection: NOTICE: relation "consensus" already exists, skipping
environmentd: 2022-09-20T19:19:03.666676Z INFO mz_compute_client::controller::replicated: starting replica task for 1
thread 'coordinator' panicked at 'state diff should apply cleanly: "last_gc_req update didn't match: SeqNo(0) vs SeqNo(3)"', /Users/mcsherry/Projects/materialize/src/persist-client/src/internal/state_diff.rs:205:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
zsh: abort bin/environmentd --release
mcsherry@gallustrate materialize %
Slack convo: https://materializeinc.slack.com/archives/CM7ATT65S/p1663701013693369
The repro got wiped, so the only thing we can do here is read through the code and think. My first guess is that somehow https://github.com/MaterializeInc/materialize/pull/14806/commits/508ed5271256991c46d590fed9a59da8691ca717 is accidentally not backward compatible. It's certainly suspicious that the field this happened to was last_gc_req. I haven't yet actually put together a concrete series of steps that gets us to this panic, though.
Okay, with an assist from @pH14, I think we've got this figured out. The summary is indeed exactly what I said above. Here's the details for how the bad case works:
- Some version of materialize before 508ed52 writes a rollup R1 with last_gc_req = 0
- It then makes some changes to the shard and eventually writes out a rollup R2 with last_gc_req = 3
- The diffs written out here do not include changes to last_gc_req (this is exactly the bug fixed by 508ed52)
- A persist machine in a new version of materialize (after 508ed52) starts up, begins at R2 and creates a diff with
(last_gc_req = 3, -1), (last_gc_req = X, 1) - Another machine starts reading the shard at R1 and progresses to the same SeqNo as R2 but via diffs (as mentioned above, they're missing the last_gc_req changes). It now has the same state as what's written in R2 except that it thinks last_gc_req is 0 instead of 3.
- It gets this
(last_gc_req = 3, -1), (last_gc_req = X, 1), tries to apply it, and boom
- It gets this
This history isn't possible without an old version of persist state. Luckily, we have a backward incompatible release next week, so we don't have to figure out a migration. We'll just revert 508ed52 for this week's release.
Thank you for debugging this. Let's close this issue after this week's release is out (with 508ed52 reverted).
"this week's release" has long since gone out :)