zed icon indicating copy to clipboard operation
zed copied to clipboard

Document compaction

Open philrz opened this issue 3 years ago • 0 comments

The following text was present in a retired "lake design" document (see #3803). It has been established that this was really a pending to-do, so this issue tracks its ultimate implementation and corresponding updates to docs.

### Compaction

To perform an LSM rollup, the `compact` command (implementation tracked
via [zed/2977](https://github.com/brimdata/zed/issues/2977))
is like a "squash" to perform LSM-like compaction function, e.g.,

zed compact <id> [<id> ...]
(merged commit <id> printed to stdout)

After compaction, all of the objects comprising the new commit are sorted
and non-overlapping.
Here, the objects from the given commit IDs are read and compacted into
a new commit.  Again, until the data is actually committed,
no readers will see any change.

Unlike other systems based on LSM, the rollups here are envisioned to be
run by orchestration agents operating on the Zed lake API.  Using
meta-queries, an agent can introspect the layout of data, perform
some computational geometry, and decide how and what to compact.
The nature of this orchestration is highly workload dependent so we plan
to develop a family of data-management orchestration agents optimized
for various use cases (e.g., continuously ingested logs vs. collections of
metrics that should be optimized with columnar form vs. slowly-changing
dimensional datasets like threat intel tables).

An orchestration layer outside of the Zed lake is responsible for defining
policy over
how data is ingested and committed and rolled up.  Depending on the
use case and workflow, we envision that some amount of overlapping data objects
would persist at small scale and always be "mixed in" with other overlapping
data during any key-range scan.

> Note: since this style of data organization follows the LSM pattern,
> how data is rolled up (or not) can control the degree of LSM write
> amplification that occurs for a given workload.  There is an explicit
> tradeoff here between overhead of merging overlapping objects on read
> and LSM write amplification to organize the data to avoid such overlaps.
>
> Note: we are showing here manual, CLI-driven steps to accomplish these tasks
> but a live data pipeline would automate all of this with orchestration that
> performs these functions via a service API, i.e., the same service API
> used by the CLI operators.

Note that the referenced issue #2977 has been completed and so this is at least partially implemented. However, the docs at https://zed.brimdata.io/docs/next/commands/zed#28-load still speak of compaction as a future enhancement, and indeed, much of the functionality described in the text above is expected to be covered by the "lake manager" tracked in #3923. Therefore, it seems this specific issue can be used to track when we think we've completed enough to reveal compaction as a feature we want users to start using, and make sure we've updated the docs with enough guidance that they'll be successful.

philrz avatar Sep 27 '22 23:09 philrz