materialize
materialize copied to clipboard
persist: blobs in S3 larger than expected
When investigating the Devex environment a few weeks ago, we had many blobs in S3 that were well beyond the expected 128MiB target. There were 10+ at 300MiB, and a few dozen 200MiB+.
I haven't dug deep into this, but my guess is this is an artifact of how we calculate columnar part sizes here: https://github.com/MaterializeInc/materialize/blob/cb9e7cf31648a5f18fd5eab77546ea3d9cce1c9f/src/persist/src/indexed/columnar.rs#L446-L454
Since the 128MiB limit is applied separately to keys and values, it seems like we could blow past 128MiB if we insert a lot of updates with similarly sized keys and values. e.g. if we had updates where the keys are 32 bytes, values are 32 bytes and 16 bytes are used for T and D, we'd arrive at 330+MB before the limit kicks in.
For better or for worse, since our last breaking change, we don't have any outliers to examine on prod right now. The top 10 largest blobs can be found via:
aws s3api list-objects-v2 --bucket <prod bucket> --query "sort_by(Contents, &Size)[-10:]" --profile <aws profile>