materialize icon indicating copy to clipboard operation
materialize copied to clipboard

WIP: suss out problems arising from storaged parallelism

Open aljoscha opened this issue 2 years ago • 4 comments

For running CI.

aljoscha avatar Sep 20 '22 13:09 aljoscha

@philip-stoev The failures where we have too many true results come from the fact that mz_materializations (which these queries use) has an entry per worker. For example, on a 4-worker cluster you will get:

materialize=> select * from mz_materializations;
 global_id | worker 
-----------+--------
 u2        |      0
 u2        |      1
 u2        |      2
 u2        |      3
(4 rows)

There is an actual bug in upsert that I have a fix for. There is another bug in Debezium that I didn't yet fix.

And I think the "Cluster smoke test" might be failing because of this known bug/flake: https://github.com/MaterializeInc/materialize/issues/14533. But I'm not yet 100% sure.

aljoscha avatar Sep 21 '22 14:09 aljoscha

Yes, I have a fix for the mz_materializations , so please disregard those for the time being.

philip-stoev avatar Sep 21 '22 14:09 philip-stoev

When ready, please push only our fix and none of the changes needed to get the --workers 4 test running. I have a separate branch that achieves that but in a different way.

philip-stoev avatar Sep 21 '22 14:09 philip-stoev

I pushed one fix in https://github.com/MaterializeInc/materialize/pull/14917. I'm afraid the tests that use envelope debezium (without upsert) are harder to fix. I think our current approach for ENVELOPE DEBEZIUM doesn't work with multiple workers because we don't maintain the order of messages that we read from Kafka. There's some decoding/exchange steps in between but our logic somewhat relies on the order being preserved. I don't want to spend much more time on this because we don't offer DEBEZIUM (without UPSERT) to users yet.

aljoscha avatar Sep 21 '22 18:09 aljoscha