materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Source with multiple partitions doesn't come online after restart

Open def- opened this issue 1 year ago • 4 comments

What version of Materialize are you using?

c32ddfb9b71e

What is the issue?

Seen in https://buildkite.com/materialize/tests/builds/77698#018e1115-407e-4a9a-b4a3-8eb134a6dac2

2024-03-06 00:40:27 UTC	Docker compose failed: docker compose -f/dev/fd/3 --project-directory /var/lib/buildkite-agent/builds/buildkite-aarch64-b9f4a11-i-09b21bfcdd8bae702-1/materialize/tests/test/platform-checks exec -T testdrive testdrive --kafka-addr=kafka:9092 --schema-registry-url=http://schema-registry:8081 --materialize-url=postgres://materialize@materialized:6875 --materialize-internal-url=postgres://materialize@materialized:6877 --aws-endpoint=http://localstack:4566 --no-reset --materialize-param=statement_timeout='300s' --default-timeout=300s --seed=1 --persist-blob-url=file:///mzdata/persist/blob --persist-consensus-url=postgres://root@materialized:26257?options=--search_path=consensus --var=replicas=1 --var=default-replica-size=4-4 --var=default-storage-size=4-1 --source=/var/lib/buildkite-agent/builds/buildkite-aarch64-b9f4a11-i-09b21bfcdd8bae702-1/materialize/tests/misc/python/materialize/checks/all_checks/multiple_partitions.py:109
2024-03-06 00:40:27 UTC	13:1: error: non-matching rows: expected:
2024-03-06 00:40:27 UTC	[["running"]]
2024-03-06 00:40:27 UTC	got:
2024-03-06 00:40:27 UTC	[["starting"]]
2024-03-06 00:40:27 UTC	Poor diff:
2024-03-06 00:40:27 UTC	- running
2024-03-06 00:40:27 UTC	+ starting

I'll check if it reproduces with bin/mzcompose --find platform-checks run default --scenario=RestartEnvironmentdClusterdStorage --check=MultiplePartitions, but probably is flaky. Edit: I couldn't reproduce the issue.

CC @nrainer-materialize since you wrote the MultiplePartitions check. I'll disable the check for now since this is pretty flaky on tests pipeline.

def- avatar Mar 06 '24 09:03 def-

I think this is not a regression, but a flake that got exposed by having faster testdrive via https://github.com/MaterializeInc/materialize/pull/25731

def- avatar Mar 06 '24 10:03 def-

I think this is not a regression, but a flake that got exposed by having faster testdrive via #25731

Would a simple sleep help to get this fixed?

nrainer-materialize avatar Mar 06 '24 11:03 nrainer-materialize

I'm a bit confused because I thought the testdrive would already wait here for a bit until the result becomes correct.

def- avatar Mar 06 '24 12:03 def-

Thanks for filing this! Added to the storage mega tracker as a p1.

benesch avatar Mar 07 '24 03:03 benesch