persist: thousands of blobs in environments left over the weekend
These were running v0.27.0-alpha.3 (I think) and so should have had the gc work that merged on monday last week.
Some initial questions:
- Can we confirm what version they were running? This used to be exposed as part of a prometheus "mz_server_metadata_seconds" counter but this seems to be missing in platform binaries
- What even is the expectation here? I could imagine an individual shard having hundreds of blobs.
- There are 22 system tables, each with a shard. The system tables are pretty much just written once at startup and then the traffic is just empty compare_and_appends, so hopefully we manage to compact those down to a small number of blobs.
- I wonder how many shards each environment has. I also wonder how large each environment's state is (and perhaps how many batches).
More questions:
- what is the since of each shard?
- what is the seqno and seqno capability of each shard (more interestingly, what time was each of those created)?
cc @aljoscha @pH14 nothing to do here for y'all, just fyi of my debuggings
Nikhil mentioned that this issue seemed to occur both for prod and for staging. I can confirm via staging prometheus metrics that gc is happening and that blob deletions calls are returning as successful (deleting is idempotent, so they'd still be successful if they were no-ops). The metric for how many blob bytes are deleted is currently a TODO.
Oh that all starts very recently. If I go back 2 days, then of the 3 clusters I seem to see in metrics, only one seems to have existed over the weekend (the other two look to have started within the last two hours). That one seems to have started gc'ing very suddenly today. I wish I knew for sure what version it was running over the weekend.
nikhil and I poked around at the devex cluster (01d730c7-*-0) and saw at least one shard that seemed to have a bunch of equally sized blobs all written on the 21st (maybe the shard got deleted? cleaning up on shard deletion still has yet to be implemented). we also found a shard that seems to have had the no-gc problem persist for multiple days.
I just got prod metrics access. a quick scan shows some envs in prod are gc'ing just fine. the devex cluster seems to not be gc'ing at all. (It also has a number of other problems persist need to look into). My first guess is that I messed up the backward-compatibility migration for the last gc PR (#13656). There is a blip right at the very beginning of the metrics where we seem to try to do one gc and then never again (was this the beginning of the cluster or is it just when the change landed to hook prometheus up to persist metrics?)
okay, I think I fixed the other problems (maybe #13735). the devex cluster is still not gc-ing though
possibly some progress! i started up a new env on staging and made a pubnub source and a simple materialized view. environmentd appears to be gc'ing but storaged and computed don't, they are "skipping" every gc request. I suspect this means some seqno capability is not being downgraded. should be easy to see what's happening with some local printlns tomorrow. (sudden idea: i wonder if these are remap shards?)
New metric ideas (assuming it's reasonable to have a few select ones be per-shard):
- upper
- since
- timestamp of latest state? (maybe wouldn't pull its weight)
- timestamp of earliest state with some capability on it
- size of serialized state
Also partially fixed by #13953, we think we're still working on at least one more
https://github.com/MaterializeInc/materialize/pull/14945 adds some extra metric coverage here, but it can be misleading since we haven't yet solved freeing blobs after the source is dropped (https://github.com/MaterializeInc/materialize/issues/8185) / periodic cleanup of unreferenced blobs. I think we'll want to tackle both those items before we can rely on the metric to feel confident we aren't leaking blobs unnecessarily
I've been looking into this, and it's a little tricky to infer blob cleanup behavior because we have known leaks (source/table being dropped, blobs written but process terminates before linking), but the issue here was filed when the leaks were egregious and very obviously wrong.
We can approximate the # of untracked blobs by subtracting our blob audit count from the sum total of tracked batch parts, and with our existing known leaks, we'd expect to see step-function-like increases (from dropped sources/tables + large snapshots written without being linked in) and a slow gradual increases (small batches being written to blob without linking in to state + long seqno holds). We would not expect to see steady, rapid increase over time as before.
Having looked at many environments, I'm pretty confident we don't have any more (egregious) blob leaks. Looking at a sampling of environments, we can see step-function-like movement in one env, and relatively flat increases across the others. We additionally have several low-activity environments that have close to zero (approximated) blob leaks, which I don't believe was the case before.
I'm comfortable closing this one out for now, to at the very least say that there's no unknown blob leak we need to be focusing on beyond what we're already aware of.