materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Load generator source crashed and failed to restart

Open benesch opened this issue 2 years ago • 9 comments

What version of Materialize are you using?

v0.27.0-alpha.8

How did you install Materialize?

Materialize Cloud

What is the issue?

I created an auction load generator source:

CREATE SOURCE lg FROM LOAD GENERATOR auction;

I let it run for about 30m. It crashed. I don’t have the logs from the first crash, unfortunately. Kubernetes doesn’t save them and I can’t find them in Honeycomb. Upon restart the storaged process entered a crash loop:

2022-07-30T06:43:34.749299Z  INFO storaged: about to bind to 0.0.0.0:2100
2022-07-30T06:43:34.749324Z  INFO storaged: serving internal HTTP server on 0.0.0.0:6878
2022-07-30T06:43:34.752799Z  INFO mz_service::grpc: Starting to listen on 0.0.0.0:2100
2022-07-30T06:43:56.166688Z  INFO mz_service::grpc: GrpcServer: remote client connected
2022-07-30T06:43:56.186778Z  INFO send_operation{operation="GetObject" service="s3"}:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: provider in chain did not provide credentials provider=Environment context=environment variable not set
2022-07-30T06:43:56.186824Z  INFO send_operation{operation="GetObject" service="s3"}:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: provider in chain did not provide credentials provider=Profile context=No profiles were defined
2022-07-30T06:43:56.187093Z  INFO send_operation{operation="GetObject" service="s3"}:provide_credentials{provider=default_chain}:send_operation{operation="AssumeRoleWithWebIdentity" service="sts"}: aws_http::auth: provider returned CredentialsNotLoaded, ignoring
2022-07-30T06:43:56.413516Z  INFO send_operation{operation="GetObject" service="s3"}:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: loaded credentials provider=WebIdentityToken
2022-07-30T06:43:56.811398Z  INFO tokio_postgres::connection: NOTICE: relation "consensus" already exists, skipping    
thread 'timely:work-0' panicked at 'invalid as_of: as_of(Antichain { elements: [0] }) < since(Antichain { elements: [1659163020001] })', src/storage/src/source/reclock.rs:92:9
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at ./rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/panicking.rs:142:14
   2: mz_storage::source::reclock::ReclockOperator::new::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0dc80afdafce99961-1/materialize/tests/src/storage/src/source/reclock.rs:92:9
   3: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at ./rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/future/mod.rs:91:19
   4: mz_storage::source::create_raw_source::{{closure}}::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0dc80afdafce99961-1/materialize/tests/src/storage/src/source/mod.rs:742:41
   5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at ./rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/future/mod.rs:91:19
   6: <async_stream::async_stream::AsyncStream<T,U> as futures_core::stream::Stream>::poll_next
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/async-stream-0.3.3/src/async_stream.rs:53:17
   7: mz_storage::source::create_raw_source::{{closure}}::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0dc80afdafce99961-1/materialize/tests/src/storage/src/source/mod.rs:890:50
   8: mz_storage::source::util::source::{{closure}}::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0dc80afdafce99961-1/materialize/tests/src/storage/src/source/util.rs:103:17
   9: timely::dataflow::operators::generic::builder_rc::OperatorBuilder<G>::build::{{closure}}::{{closure}}
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/dataflow/operators/generic/builder_rc.rs:123:31
  10: timely::dataflow::operators::generic::builder_rc::OperatorBuilder<G>::build_reschedule::{{closure}}
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/dataflow/operators/generic/builder_rc.rs:162:26
  11: <timely::dataflow::operators::generic::builder_raw::OperatorCore<T,L> as timely::scheduling::Schedule>::schedule
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/dataflow/operators/generic/builder_raw.rs:204:9
  12: timely::progress::subgraph::PerOperatorState<T>::schedule
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/progress/subgraph.rs:684:30
  13: timely::progress::subgraph::Subgraph<TOuter,TInner>::activate_child
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/progress/subgraph.rs:344:26
  14: <timely::progress::subgraph::Subgraph<TOuter,TInner> as timely::scheduling::Schedule>::schedule
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/progress/subgraph.rs:310:17
  15: timely::progress::subgraph::PerOperatorState<T>::schedule
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/progress/subgraph.rs:684:30
  16: timely::progress::subgraph::Subgraph<TOuter,TInner>::activate_child
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/progress/subgraph.rs:344:26
  17: <timely::progress::subgraph::Subgraph<TOuter,TInner> as timely::scheduling::Schedule>::schedule
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/progress/subgraph.rs:310:17
  18: timely::worker::Wrapper::step::{{closure}}
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/worker.rs:754:57
  19: core::option::Option<T>::map
             at ./rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/option.rs:909:29
  20: timely::worker::Wrapper::step
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/worker.rs:754:26
  21: timely::worker::Worker<A>::step_or_park
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/worker.rs:396:38
  22: mz_storage::storage_state::Worker<A>::run_client
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0dc80afdafce99961-1/materialize/tests/src/storage/src/storage_state.rs:109:13
  23: mz_storage::storage_state::Worker<A>::run
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0dc80afdafce99961-1/materialize/tests/src/storage/src/storage_state.rs:93:33
  24: mz_storage::protocol::server::serve::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0dc80afdafce99961-1/materialize/tests/src/storage/src/protocol/server.rs:87:9
  25: timely::execute::execute::{{closure}}
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/timely/src/execute.rs:287:22
  26: timely_communication::initialize::initialize_from::{{closure}}
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/ee65cdc/communication/src/initialize.rs:316:33
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Stream closed EOF for environment-4777bfc1-2bd8-4a13-a5dd-a44928652335-0/storage-u1-0 (default)

I suspect that load generator sources do not restart correctly. Hopefully this readily repros locally.

cc @mjibson

Relevant log output

No response

benesch avatar Jul 30 '22 16:07 benesch

This doesn't repro locally with the process orchestrator. I then started a local kind cluster and started it, then ran copy tail in a kubectl exec psql terminal. In k9s I killed the environmentd pod which disconnected, but on reconnect the tail was fine. Then I killed the storaged pod. This caused the tail output to stop producing new rows for some seconds, then the rows resumed exactly as expected.

Unless we have a better way to repro this I'm not sure there's much else I can do. Is it possible this wasn't a load generator bug, but a as_of/since problem? We've had some changes there recently.

maddyblue avatar Aug 02 '22 05:08 maddyblue

Let me give a repro a shot. Cockroach started applying a maintenance update at the exact moment this crashed, so I'm going to try restarting PostgreSQL locally and seeing if that triggers the same crash.

benesch avatar Aug 02 '22 16:08 benesch

Ok, here's something interesting:

$ bin/environmentd --reset
$ psql -h localhost -p 6875 -U materialize materialize
> CREATE SOURCE lg FROM LOAD GENERATOR auction
> SELECT * FROM lg
# Wait a minute.
$ killall storaged
$ psql -h localhost -p 6875 -U materialize materialize
> SELECT * FROM lg
# This will hang for however long it takes storaged to catch back up. If storaged was up for a minute before it was killed, seems to take a minute for it to catch back up.

I'm going to try running these steps again but waiting 10m to kill storaged. That might expose the crash loop I was seeing in cloud.

benesch avatar Aug 02 '22 16:08 benesch

I did observe the waiting for same time thing. I had assumed it was just storaged startup time, but it did feel longer than I expected. https://github.com/MaterializeInc/materialize/blob/9850d790a6e9f694bc6c04097eef1bc65a1f4752/src/storage/src/source/generator.rs#L84 is supposed to handle this, but it assumes the offset is set to the number of Finalized messages it sent back in all previous executions. Is that not the case?

maddyblue avatar Aug 02 '22 17:08 maddyblue

Is that not the case?

I don't know! But I assume not. I just did an experiment where I waited 18m between starting and killing storaged. Didn't cause the crash I saw in cloud, but my SELECT against the source is now quite hung. EXPLAIN TIMESTAMP shows an 18m skew between the current time and the source upper.

benesch avatar Aug 02 '22 17:08 benesch

I'll look into this.

maddyblue avatar Aug 02 '22 17:08 maddyblue

Adding some debugs:

        dbg!(&offset);
        let start = Instant::now();
        for _ in 0..offset.offset {
            dbg!(rows.next());
        }
        dbg!(start.elapsed());

After letting the counter load gen run into the 400s at a 1 second interval prints:

[src/storage/src/source/generator.rs:84] &offset = MzOffset {
    offset: 457,
}
...
[src/storage/src/source/generator.rs:87] rows.next() = Some(
    Row{[
        Int64(
            456,
        ),
    ]},
)
[src/storage/src/source/generator.rs:87] rows.next() = Some(
    Row{[
        Int64(
            457,
        ),
    ]},
)
[src/storage/src/source/generator.rs:89] start.elapsed() = 37.418ms

And running a COPY TAIL of that only takes a few seconds after starting up to start producing tail results again. This is not obviously the generator's problem. Doing this a few times, though, sometimes was quite slow after startup. I noticed this in the logs:

[1mcompute-cluster-1-replica-1:[0m [2m2022-08-02T20:33:14.636149Z[0m [32m INFO[0m [2mmz_persist_client::r#impl::machine[0m[2m:[0m snapshot s20d52d65-76ba-4ff8-a719-5fd042a18949 as of Antichain { elements: [1659472403778] } not yet available for upper Antichain { elements: [1659472394001] } retrying in 64ms
[1mcompute-cluster-1-replica-1:[0m [2m2022-08-02T20:33:14.708200Z[0m [32m INFO[0m [2mmz_persist_client::r#impl::machine[0m[2m:[0m snapshot s20d52d65-76ba-4ff8-a719-5fd042a18949 as of Antichain { elements: [1659472403778] } not yet available for upper Antichain { elements: [1659472394001] } retrying in 128ms
[1mcompute-cluster-1-replica-1:[0m [2m2022-08-02T20:33:14.848384Z[0m [32m INFO[0m [2mmz_persist_client::r#impl::machine[0m[2m:[0m snapshot s20d52d65-76ba-4ff8-a719-5fd042a18949 as of Antichain { elements: [1659472403778] } not yet available for upper Antichain { elements: [1659472395001] } retrying in 256ms
[1mcompute-cluster-1-replica-1:[0m [2m2022-08-02T20:33:15.129814Z[0m [32m INFO[0m [2mmz_persist_client::r#impl::machine[0m[2m:[0m snapshot s20d52d65-76ba-4ff8-a719-5fd042a18949 as of Antichain { elements: [1659472403778] } not yet available for upper Antichain { elements: [1659472395001] } retrying in 512ms
[1mcompute-cluster-1-replica-1:[0m [2m2022-08-02T20:33:15.626943Z[0m [32m INFO[0m [2mmz_persist_client::r#impl::machine[0m[2m:[0m snapshot s20d52d65-76ba-4ff8-a719-5fd042a18949 as of Antichain { elements: [1659472403778] } not yet available for upper Antichain { elements: [1659472395001] } retrying in 1.024s
[1mcompute-cluster-1-replica-1:[0m [2m2022-08-02T20:33:16.735004Z[0m [32m INFO[0m [2mmz_persist_client::r#impl::machine[0m[2m:[0m snapshot s20d52d65-76ba-4ff8-a719-5fd042a18949 as of Antichain { elements: [1659472403778] } not yet available for upper Antichain { elements: [1659472396001] } retrying in 2.048s
[1mcompute-cluster-1-replica-1:[0m [2m2022-08-02T20:33:18.608365Z[0m [32m INFO[0m [2mmz_persist_client::r#impl::machine[0m[2m:[0m snapshot s20d52d65-76ba-4ff8-a719-5fd042a18949 as of Antichain { elements: [1659472403778] } not yet available for upper Antichain { elements: [1659472398001] } retrying in 4.096s
[1mcompute-cluster-1-replica-1:[0m [2m2022-08-02T20:33:22.956492Z[0m [32m INFO[0m [2mmz_persist_client::r#impl::machine[0m[2m:[0m snapshot s20d52d65-76ba-4ff8-a719-5fd042a18949 as of Antichain { elements: [1659472403778] } not yet available for upper Antichain { elements: [1659472403001] } retrying in 8.192s

Seems that computed is doing an exponential retry backoff, which I do think is responsible for this.

maddyblue avatar Aug 02 '22 20:08 maddyblue

Seems that computed is doing an exponential retry backoff, which I do think is responsible for this.

@mjibson would you like the compute team to take a look at this? (cc @chaas)

JLDLaughlin avatar Aug 04 '22 15:08 JLDLaughlin

Restarting cockroach after creating the source spits out a bunch of [1menvironmentd:[0m [2m2022-08-08T23:36:37.849813Z[0m [33m WARN[0m [2mmz_storage::controller::persist_write_handles[0m[2m:[0m Retrying indeterminate table write: Indeterminate { inner: error connecting to server: Connection refused (os error 111), but then resumes normally. It's possible this got fixed or is triggered in some other way. I'm not sure if the compute backoff is a real bug or not. Someone from that team can decide!

maddyblue avatar Aug 08 '22 23:08 maddyblue

I made a auction source, then ran copy (tail demo) to stdout in a terminal and got some rows. Then pkill storaged, which paused the tail and I see in the logs that it restarts after some seconds, tail stream resumes as expected. Then pkill computed and same thing: tail pauses, computed restarts, tail resumes as normal. Am currently out of ideas for how to repro this locally.

maddyblue avatar Aug 10 '22 22:08 maddyblue

One thing we haven't tried (I think) is killing Cockroach, waiting a few minutes, and then turning it back on. Maybe something goes wrong like that.

benesch avatar Aug 11 '22 04:08 benesch

Killed cockroach for 4 minutes then restarted it with a 50ms tick. The tail resumed a few seconds later, didn't see any crashes in the console. One thing that did happen though was the tail had queued all of the data it had generated over those minutes. But even with a 50ms interval that should not be enough to oom anything that is queuing it.

maddyblue avatar Aug 11 '22 16:08 maddyblue

Removed the triage label as this is on the project backlog and discussions are happening.

uce avatar Sep 27 '22 10:09 uce

I want to re-raise this as I just ran into this bug following the "getting started guide" and I think many (all, even?) people will run into this @uce .

I can also confirm it still happens with v0.28.0. The error message looks slightly different now, but I'm quite sure it's the same bug:

> kubectl logs --context mzcloud-staging-eu-west-1-0 -n environment-da15e202-1c10-4bf8-8a84-a6994358457d-0 storage-u6-0 --previous                                                          1
2022-10-19T09:17:51.591939Z  INFO storaged: about to bind to 0.0.0.0:2100
2022-10-19T09:17:51.591984Z  INFO storaged: serving internal HTTP server on 0.0.0.0:6878
2022-10-19T09:17:51.597586Z  INFO mz_service::grpc: Starting to listen on 0.0.0.0:2100
2022-10-19T09:18:04.958248Z  INFO mz_service::grpc: GrpcServer: remote client connected
2022-10-19T09:18:05.725990Z  INFO provide_credentials{provider=default_chain}: aws_http::auth: provider returned CredentialsNotLoaded, ignoring
2022-10-19T09:18:06.287845Z  INFO tokio_postgres::connection: NOTICE: relation "consensus" already exists, skipping    
2022-10-19T09:18:07.712344Z  INFO mz_persist_client::internal::maintenance: Force expiring reader (r485af5b0-4531-416a-b005-9eca1b2a9fc9) of shard (sd7a4d70f-644b-4160-8a52-fc37a63f5def) due to inactivity
2022-10-19T09:18:07.712455Z  INFO mz_persist_client::internal::maintenance: Force expiring reader (r536ad54b-0d29-4b98-8766-d61efb0b0190) of shard (sd7a4d70f-644b-4160-8a52-fc37a63f5def) due to inactivity
2022-10-19T09:18:07.712532Z  INFO mz_persist_client::internal::maintenance: Force expiring writer (w9203d424-52b1-451e-b965-0ed9e1b83112) of shard (sd7a4d70f-644b-4160-8a52-fc37a63f5def) due to inactivity
2022-10-19T09:18:07.712609Z  INFO mz_persist_client::internal::maintenance: Force expiring writer (wfed998fe-ccdb-4d3a-aa53-75985fc407eb) of shard (sd7a4d70f-644b-4160-8a52-fc37a63f5def) due to inactivity
thread 'timely:work-0' panicked at 'assertion failed: PartialOrder::less_equal(&self.since, &new_since)', src/storage/src/source/reclock.rs:537:9
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at ./rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/panicking.rs:142:14
   2: core::panicking::panic
             at ./rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/panicking.rs:48:5
   3: mz_storage::source::reclock::ReclockOperator::compact::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-01110d61e963703c7-1/materialize/tests/src/storage/src/source/reclock.rs:537:9
   4: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at ./rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/future/mod.rs:91:19
   5: mz_storage::source::source_reader_pipeline::remap_operator::{{closure}}::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-01110d61e963703c7-1/materialize/tests/src/storage/src/source/source_reader_pipeline.rs:966:70
   6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at ./rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/future/mod.rs:91:19
   7: <futures_util::future::future::fuse::Fuse<Fut> as core::future::future::Future>::poll
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.24/src/future/future/fuse.rs:86:37
   8: <core::pin::Pin<P> as core::future::future::Future>::poll
             at ./rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/future/future.rs:124:9
   9: <timely::dataflow::operators::generic::builder_rc::OperatorBuilder<G> as mz_timely_util::operators_async_ext::OperatorBuilderExt<G>>::build_async::{{closure}}::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-01110d61e963703c7-1/materialize/tests/src/timely-util/src/operators_async_ext.rs:165:43
  10: timely::dataflow::operators::generic::builder_rc::OperatorBuilder<G>::build_reschedule::{{closure}}
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/timely/src/dataflow/operators/generic/builder_rc.rs:162:26
  11: <timely::dataflow::operators::generic::builder_raw::OperatorCore<T,L> as timely::scheduling::Schedule>::schedule
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/timely/src/dataflow/operators/generic/builder_raw.rs:204:9
  12: timely::progress::subgraph::PerOperatorState<T>::schedule
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/timely/src/progress/subgraph.rs:684:30
  13: timely::progress::subgraph::Subgraph<TOuter,TInner>::activate_child
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/timely/src/progress/subgraph.rs:344:26
  14: <timely::progress::subgraph::Subgraph<TOuter,TInner> as timely::scheduling::Schedule>::schedule
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/timely/src/progress/subgraph.rs:310:17
  15: timely::progress::subgraph::PerOperatorState<T>::schedule
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/timely/src/progress/subgraph.rs:684:30
  16: timely::progress::subgraph::Subgraph<TOuter,TInner>::activate_child
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/timely/src/progress/subgraph.rs:344:26
  17: <timely::progress::subgraph::Subgraph<TOuter,TInner> as timely::scheduling::Schedule>::schedule
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/timely/src/progress/subgraph.rs:310:17
  18: timely::worker::Wrapper::step::{{closure}}
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/timely/src/worker.rs:754:57
  19: core::option::Option<T>::map
             at ./rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/option.rs:929:29
  20: timely::worker::Wrapper::step
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/timely/src/worker.rs:754:26
  21: timely::worker::Worker<A>::step_or_park
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/timely/src/worker.rs:396:38
  22: mz_storage::storage_state::Worker<A>::run_client
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-01110d61e963703c7-1/materialize/tests/src/storage/src/storage_state.rs:134:17
  23: mz_storage::storage_state::Worker<A>::run
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-01110d61e963703c7-1/materialize/tests/src/storage/src/storage_state.rs:110:33
  24: mz_storage::protocol::server::serve::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-01110d61e963703c7-1/materialize/tests/src/storage/src/protocol/server.rs:94:9
  25: timely::execute::execute::{{closure}}
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/timely/src/execute.rs:287:22
  26: timely_communication::initialize::initialize_from::{{closure}}
             at ./cargo/git/checkouts/timely-dataflow-4c0cc365061cd263/9548bf9/communication/src/initialize.rs:316:33
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

lluki avatar Oct 19 '22 09:10 lluki

ha! this bug is different from the originally reported one. We changed how far we allow compaction in v0.28.0 and my gut feeling is that on restart we pick a "new since" that is slightly more in the past when just starting up. As long as the assert as_of >= since on creation is upheld, we should be golden though. @guswynn wdyt, we could just turn this into a no-op when new_since is less than since?

aljoscha avatar Oct 19 '22 10:10 aljoscha

also, cc @petrosagg wdyt?

aljoscha avatar Oct 19 '22 10:10 aljoscha

I commented in the linked PR, but I don't think we should remove the assertion as well-behaving code would never trigger it. I would leave the assertion there until we understand what's going on

petrosagg avatar Oct 19 '22 14:10 petrosagg