materialize icon indicating copy to clipboard operation
materialize copied to clipboard

testdrive: Try to make skipping consistency check be an error

Open def- opened this issue 1 year ago • 3 comments
trafficstars

@ParkMyCar I think something is wrong here, most testdrive-based tests I checked are spamming this warning (except the Testdrive test itself). Is it maybe necessary to have volumes_extra=["mzdata:/mzdata"], in every test? I'll try defaulting it to that.

Checklist

  • [ ] This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • [ ] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • [ ] If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • [ ] If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

def- avatar Oct 14 '24 07:10 def-

I'll try defaulting it to that.

It didn't work, so I'm out of ideas for why this isn't working. Do you have any idea?

def- avatar Oct 14 '24 12:10 def-

Interesting! Which warning is getting spammed, the "system_parameter_defaults" one or the "Catalog state on disk"?

For "system_parameter_defaults" it's possible the format of the Catalog's state dump has changed and this code needs to be updated. For the "Catalog state on disk" I'm going to guess the catalog_config option isn't getting set. Maybe we need to thread through some new CLI args or plumb existing ones to this config.

ParkMyCar avatar Oct 15 '24 16:10 ParkMyCar

@ParkMyCar It's the "Catalog state on disk". I'm now enabling the config to always exist, but that is causing a lot of strange looking timeouts: https://buildkite.com/materialize/test/builds/92560

def- avatar Oct 16 '24 22:10 def-

Rebased, new run: https://buildkite.com/materialize/test/builds/93532 Edit: Still seeing timeouts. There is a very interesting panic:

thread 'main' panicked at src/catalog/src/durable/persist.rs:1194:63:
kind decoding error: "missing field `extra_versions`"

I guess we'll need some help from the Persist team, I'll ask in their Slack channel.

def- avatar Oct 29 '24 09:10 def-

There is a very interesting panic:

thread 'main' panicked at src/catalog/src/durable/persist.rs:1194:63:
kind decoding error: "missing field `extra_versions`"

Ahhh this is interesting, it looks like the panic is coming from the Legacy Upgrade tests? I think what's happening here is we start an older version of Materialize, run some workflow, then before upgrading to a newer version of MZ we run the consistency checks. But because the consistency checks are run via the testdrive runner we end up using the latest version of the Catalog protobufs which fails because we only run the Catalog migrations if we're not in read-only.

We could either make the consistency checks version aware, and skip if the the current MZ version != the version of the testdrive runner, or just skip them for the legacy upgrade tests

ParkMyCar avatar Oct 29 '24 13:10 ParkMyCar

Ok, totally fine to skip them for the upgrade tests then. I'm more worried about the timeouts since I have no workaround for them.

def- avatar Oct 29 '24 14:10 def-