materialize
materialize copied to clipboard
Source with multiple partitions doesn't come online after restart
What version of Materialize are you using?
c32ddfb9b71e
What is the issue?
Seen in https://buildkite.com/materialize/tests/builds/77698#018e1115-407e-4a9a-b4a3-8eb134a6dac2
2024-03-06 00:40:27 UTC Docker compose failed: docker compose -f/dev/fd/3 --project-directory /var/lib/buildkite-agent/builds/buildkite-aarch64-b9f4a11-i-09b21bfcdd8bae702-1/materialize/tests/test/platform-checks exec -T testdrive testdrive --kafka-addr=kafka:9092 --schema-registry-url=http://schema-registry:8081 --materialize-url=postgres://materialize@materialized:6875 --materialize-internal-url=postgres://materialize@materialized:6877 --aws-endpoint=http://localstack:4566 --no-reset --materialize-param=statement_timeout='300s' --default-timeout=300s --seed=1 --persist-blob-url=file:///mzdata/persist/blob --persist-consensus-url=postgres://root@materialized:26257?options=--search_path=consensus --var=replicas=1 --var=default-replica-size=4-4 --var=default-storage-size=4-1 --source=/var/lib/buildkite-agent/builds/buildkite-aarch64-b9f4a11-i-09b21bfcdd8bae702-1/materialize/tests/misc/python/materialize/checks/all_checks/multiple_partitions.py:109
2024-03-06 00:40:27 UTC 13:1: error: non-matching rows: expected:
2024-03-06 00:40:27 UTC [["running"]]
2024-03-06 00:40:27 UTC got:
2024-03-06 00:40:27 UTC [["starting"]]
2024-03-06 00:40:27 UTC Poor diff:
2024-03-06 00:40:27 UTC - running
2024-03-06 00:40:27 UTC + starting
I'll check if it reproduces with bin/mzcompose --find platform-checks run default --scenario=RestartEnvironmentdClusterdStorage --check=MultiplePartitions, but probably is flaky. Edit: I couldn't reproduce the issue.
CC @nrainer-materialize since you wrote the MultiplePartitions check. I'll disable the check for now since this is pretty flaky on tests pipeline.
I think this is not a regression, but a flake that got exposed by having faster testdrive via https://github.com/MaterializeInc/materialize/pull/25731
I think this is not a regression, but a flake that got exposed by having faster testdrive via #25731
Would a simple sleep help to get this fixed?
I'm a bit confused because I thought the testdrive would already wait here for a bit until the result becomes correct.
Thanks for filing this! Added to the storage mega tracker as a p1.