materialize icon indicating copy to clipboard operation
materialize copied to clipboard

panic at 'state diff should apply cleanly: "last_gc_req update didn't match: SeqNo(0) vs SeqNo(3)"'

Open frankmcsherry opened this issue 3 years ago • 3 comments

What version of Materialize are you using?

main

How did you install Materialize?

Built from source

What is the issue?

Working locally, just screwing around with some LIR planning, one rebuild resulted in what appears to be a wedged state, in which environmentd cannot boot. I'm afraid I haven't captured the state to reproduce this (reportedly it would span postgres state as well as MZ and persist state).

I haven't seen this before, and --reset returns it to a booting state.

Relevant log output

mcsherry@gallustrate materialize % bin/environmentd --release
$ cargo build --release --bin storaged --bin computed --bin environmentd
    Finished release [optimized + debuginfo] target(s) in 1.54s
warning: Existing computed process (PID 46040) will be reused
$ psql -At postgres://mcsherry@%2Ftmp -c 'CREATE SCHEMA IF NOT EXISTS consensus'
NOTICE:  schema "consensus" already exists, skipping
CREATE SCHEMA
$ psql -At postgres://mcsherry@%2Ftmp -c 'CREATE SCHEMA IF NOT EXISTS adapter'
NOTICE:  schema "adapter" already exists, skipping
CREATE SCHEMA
$ psql -At postgres://mcsherry@%2Ftmp -c 'CREATE SCHEMA IF NOT EXISTS storage'
NOTICE:  schema "storage" already exists, skipping
CREATE SCHEMA
$ /Users/mcsherry/Projects/materialize/target/release/environmentd --persist-consensus-url=postgres://mcsherry@%2Ftmp?options=--search_path=consensus --persist-blob-url=file:///Users/mcsherry/Projects/materialize/mzdata/persist/blob --adapter-stash-url=postgres://mcsherry@%2Ftmp?options=--search_path=adapter --storage-stash-url=postgres://mcsherry@%2Ftmp?options=--search_path=storage --environment-id=local-az1-15aaa5c5-83bb-452f-a335-f519a493b6a8-0
booting server
environmentd v0.27.0-dev (5d62a210e)
OpenSSL 1.1.1q  5 Jul 2022
librdkafka v1.8.2
invoked as: MZ_ROOT=/Users/mcsherry/Projects/materialize /Users/mcsherry/Projects/materialize/target/release/environmentd '--persist-consensus-url=postgres://mcsherry@%2Ftmp?options=--search_path=consensus' '--persist-blob-url=file:///Users/mcsherry/Projects/materialize/mzdata/persist/blob' '--adapter-stash-url=postgres://mcsherry@%2Ftmp?options=--search_path=adapter' '--storage-stash-url=postgres://mcsherry@%2Ftmp?options=--search_path=storage' '--environment-id=local-az1-15aaa5c5-83bb-452f-a335-f519a493b6a8-0'
os: Mac OS 12.2.1 [64-bit]
cpus: 8 logical, 4 physical, 4 useful
cpu0: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz 0MHz
memory: 34359738368KB total, 26505273344KB used
swap: 2147483648KB total, 1059061760KB used
max log level: info
environmentd: 2022-09-20T19:19:03.433733Z  INFO mz_adapter::catalog: create table materialize.public.part (u1)
environmentd: 2022-09-20T19:19:03.433833Z  INFO mz_adapter::catalog: create table materialize.public.supplier (u2)
environmentd: 2022-09-20T19:19:03.433879Z  INFO mz_adapter::catalog: create table materialize.public.partsupp (u3)
environmentd: 2022-09-20T19:19:03.439111Z  INFO mz_adapter::catalog: create view materialize.public.q16 (u4)
environmentd: 2022-09-20T19:19:03.439271Z  INFO mz_adapter::catalog: create index materialize.public.pk_part (u5)
environmentd: 2022-09-20T19:19:03.439312Z  INFO mz_adapter::catalog: create index materialize.public.pk_partsupp (u6)
environmentd: 2022-09-20T19:19:03.439338Z  INFO mz_adapter::catalog: create index materialize.public.pk_supplier (u7)
environmentd: 2022-09-20T19:19:03.439362Z  INFO mz_adapter::catalog: create index materialize.public.fk_partsupp_partkey (u8)
environmentd: 2022-09-20T19:19:03.439395Z  INFO mz_adapter::catalog: create index materialize.public.fk_partsupp_suppkey (u9)
environmentd: 2022-09-20T19:19:03.439421Z  INFO mz_adapter::catalog: create index materialize.public.fk_supplier_nationkey (u10)
environmentd: 2022-09-20T19:19:03.439454Z  INFO mz_adapter::catalog: create index materialize.public.q16_primary_idx (u11)
environmentd: 2022-09-20T19:19:03.450934Z  INFO mz_adapter::catalog: create table materialize.public.part (u1)
environmentd: 2022-09-20T19:19:03.451042Z  INFO mz_adapter::catalog: create table materialize.public.supplier (u2)
environmentd: 2022-09-20T19:19:03.451095Z  INFO mz_adapter::catalog: create table materialize.public.partsupp (u3)
environmentd: 2022-09-20T19:19:03.457199Z  INFO mz_adapter::catalog: create view materialize.public.q16 (u4)
environmentd: 2022-09-20T19:19:03.457370Z  INFO mz_adapter::catalog: create index materialize.public.pk_part (u5)
environmentd: 2022-09-20T19:19:03.457407Z  INFO mz_adapter::catalog: create index materialize.public.pk_partsupp (u6)
environmentd: 2022-09-20T19:19:03.457433Z  INFO mz_adapter::catalog: create index materialize.public.pk_supplier (u7)
environmentd: 2022-09-20T19:19:03.457477Z  INFO mz_adapter::catalog: create index materialize.public.fk_partsupp_partkey (u8)
environmentd: 2022-09-20T19:19:03.457536Z  INFO mz_adapter::catalog: create index materialize.public.fk_partsupp_suppkey (u9)
environmentd: 2022-09-20T19:19:03.457570Z  INFO mz_adapter::catalog: create index materialize.public.fk_supplier_nationkey (u10)
environmentd: 2022-09-20T19:19:03.457617Z  INFO mz_adapter::catalog: create index materialize.public.q16_primary_idx (u11)
environmentd: 2022-09-20T19:19:03.488166Z  INFO tokio_postgres::connection: NOTICE: relation "consensus" already exists, skipping    
environmentd: 2022-09-20T19:19:03.666676Z  INFO mz_compute_client::controller::replicated: starting replica task for 1
thread 'coordinator' panicked at 'state diff should apply cleanly: "last_gc_req update didn't match: SeqNo(0) vs SeqNo(3)"', /Users/mcsherry/Projects/materialize/src/persist-client/src/internal/state_diff.rs:205:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
zsh: abort      bin/environmentd --release
mcsherry@gallustrate materialize % 

Slack convo: https://materializeinc.slack.com/archives/CM7ATT65S/p1663701013693369

frankmcsherry avatar Sep 20 '22 19:09 frankmcsherry

The repro got wiped, so the only thing we can do here is read through the code and think. My first guess is that somehow https://github.com/MaterializeInc/materialize/pull/14806/commits/508ed5271256991c46d590fed9a59da8691ca717 is accidentally not backward compatible. It's certainly suspicious that the field this happened to was last_gc_req. I haven't yet actually put together a concrete series of steps that gets us to this panic, though.

danhhz avatar Sep 20 '22 20:09 danhhz

Okay, with an assist from @pH14, I think we've got this figured out. The summary is indeed exactly what I said above. Here's the details for how the bad case works:

  • Some version of materialize before 508ed52 writes a rollup R1 with last_gc_req = 0
  • It then makes some changes to the shard and eventually writes out a rollup R2 with last_gc_req = 3
    • The diffs written out here do not include changes to last_gc_req (this is exactly the bug fixed by 508ed52)
  • A persist machine in a new version of materialize (after 508ed52) starts up, begins at R2 and creates a diff with (last_gc_req = 3, -1), (last_gc_req = X, 1)
  • Another machine starts reading the shard at R1 and progresses to the same SeqNo as R2 but via diffs (as mentioned above, they're missing the last_gc_req changes). It now has the same state as what's written in R2 except that it thinks last_gc_req is 0 instead of 3.
    • It gets this (last_gc_req = 3, -1), (last_gc_req = X, 1), tries to apply it, and boom

This history isn't possible without an old version of persist state. Luckily, we have a backward incompatible release next week, so we don't have to figure out a migration. We'll just revert 508ed52 for this week's release.

danhhz avatar Sep 20 '22 20:09 danhhz

Thank you for debugging this. Let's close this issue after this week's release is out (with 508ed52 reverted).

uce avatar Sep 20 '22 21:09 uce

"this week's release" has long since gone out :)

danhhz avatar Oct 19 '22 14:10 danhhz