make issues from lake design notes

Open mccanne opened this issue 3 years ago • 0 comments
The text below was removed from the Zed lake specification as these are really design notes for pending items. We should create issues from these notes and update the docs as the pieces are completed.

## Continuous Ingest

While the description above is very batch oriented, the Zed lake design is
intended to perform scalably for continuous streaming applications.  In this
approach, many small commits may be continuously executed as data arrives and
after each commit, the data is immediately readable.

To handle this use case, the _journal_ of branch commits is designed
to scale to arbitrarily large footprints as described earlier.

## Derived Analytics

To improve the performance of predictable workloads, many use cases of a
Zed lake pre-compute _derived analytics_ or a particular set of _partial
aggregations_.

For example, the Brim app displays a histogram of event counts grouped by
a category over time.  The partial aggregation for such a computation can be
configured to run automatically and store the result in a pool designed to
hold such results.  Then, when a scan is run, the Zed analytics engine
recognizes when the DAG of a query can be rewritten to assemble the
partial results instead of deriving the answers from scratch.

When and how such partial aggregations are performed is simply a matter of
writing Zed queries that take the raw data and produce the derived analytics
while conforming to a naming model that allows the Zed lake to recognize
the relationship between the raw data and the derived data.

> TBD: Work out these details which are reminiscent of the analytics cache
> developed in our earlier prototype.

## Keyless Data

This is TBD.  Data without a key should be accepted some way or another.
One approach is to simply assign the "zero-value" as the pool key; another
is to use a configured default value.  This would make key-based
retention policies more complicated.

Another approach would be to create a sub-pool on demand when the first
keyless data is encountered, e.g., `pool-name.$nokey` where the pool key
is configured to be "this".  This way, an app or user could query this pool
by this name to scan keyless data.

## Relational Model

Since a Zed lake can provide strong consistency, workflows that manipulate
data in a lake can utilize a model where updates are made to the data
in place.  Such updates involve creating new commits from the old data
where the new data is a modified form of the old data.  This provides
emulation of row-based updates and deletes.

If the pool-key is chosen to be "this" for such a use case, then unique
rows can be maintained by trivially detected duplicates (because any
duplicate row will be adjacent when sorted by "this") so that duplicates are
trivially detected.

Efficient upserts can be accomplished because each data object is sorted by the
pool key.  Thus, an upsert can be sorted then merge-joined with each
overlapping object.  Where data objects produce changes and additions, they can
be forwarded to a streaming add operator and the list of modified objects
accumulated.  At the end of the operation, then new commit(s) along with
the to-be-deleted objects can be added to the journal in a single atomic
operation.  A write conflict occurs if there are any other deletes added to
the list of to-be-deleted objects.  When this occurs, the transaction can
simply be restarted.  To avoid inefficiency of many restarts, an upsert can
be partitioned into smaller objects if the use case allows for it.

> TBD: Finish working through this use case, its requirements, and the
> mechanisms needed to implement it.  Write conflicts will need to be
> managed at a layer above the journal or the journal extended with the
> needed functionality.

## Type Rule

A type rule indicates that all values of any field of a specified type
be indexed where the type signature uses Zed type syntax.
For example,

zed index create IndexGroupEx type ip


creates a rule that indexes all IP addresses appearing in fields of type `ip`
in the index group `IndexGroupEx`.

## Aggregation Rule

An aggregation rule allows the creation of any index keyed by one or more fields
(primary, second, etc.), typically the result of an aggregation.
The aggregation is specified as a Zed query.
For example,

zed index create IndexGroupEx agg "count() by field"

creates a rule that creates an index keyed by the group-by keys whose
values are the partial-result aggregation given by the Zed expression.

> This is not yet implemented.  The query planner would replace any full object
> scan with the needed aggregation with the result given in the index.
> Where a filter is applied to match one row of the index, that result could
> likewise be extracted instead of scanning the entire object.
> This capability is not generally useful for interactive search and analytics
> (except for optimizations that suit the interactive app) but rather is a powerful
> capability for application-specific workflows that know the pre-computed
> aggregations that they will use ahead of time, e.g., beacon analysis
> of network security logs.

## Vacuum Support

While data objects currently can be deleted from a lake, the underlying data
is retained to support time travel.

The system must also support purging of old data so that retention policies
can be implemented.

This could be supported with the DANGER-ZONE command `zed vacuum`
(implementation tracked in [zed/2545](https://github.com/brimdata/zed/issues/2545)).
The commits still appear in the log but scans at any time-travel point
where the commit is present will fail to scan the deleted data.
In this case, perhaps we should emit a structured Zed error describing
the meta-data of the object that was unavailable.

Alternatively, old data can be removed from the system using a safer
command (but still in the DANGER-ZONE), `zed vacate` (also
[zed/2545](https://github.com/brimdata/zed/issues/2545)) which moves
the tail of the commit journal forward and removes any data no longer
accessible through the modified commit journal.



#### Scaling a Journal

When the sizes of the journal snapshot files exceed a certain size
(and thus becomes too large to conveniently handle in memory),
the snapshots can be converted to and stored
in an internal sub-pool called the "snapshot pool".  The snapshot pool's
pool key is the "from" value (of its parent pool key) from each commit action.
In this case, commits to the parent pool are made in the same fashion,
but instead of snapshotting updates into a snapshot ZNG file,
the snapshots are committed to the journal sub-pool.  In this way, commit histories
can be rolled up and organized by the pool key.  Likewise, retention policies
based on the pool key can remove not just data objects from the main pool but
also data objects in the journal pool comprising committed data that falls
outside of the retention boundary.

> Note we currently record a delete using only the object ID.  In order to
> organize add and delete actions around key spans, we need to add the span
> metadata to the delete action just as it exists in the add action.

### Compaction

To perform an LSM rollup, the `compact` command (implementation tracked
via [zed/2977](https://github.com/brimdata/zed/issues/2977))
is like a "squash" to perform LSM-like compaction function, e.g.,

zed compact <id> [<id> ...]
(merged commit <id> printed to stdout)

After compaction, all of the objects comprising the new commit are sorted
and non-overlapping.
Here, the objects from the given commit IDs are read and compacted into
a new commit.  Again, until the data is actually committed,
no readers will see any change.

Unlike other systems based on LSM, the rollups here are envisioned to be
run by orchestration agents operating on the Zed lake API.  Using
meta-queries, an agent can introspect the layout of data, perform
some computational geometry, and decide how and what to compact.
The nature of this orchestration is highly workload dependent so we plan
to develop a family of data-management orchestration agents optimized
for various use cases (e.g., continuously ingested logs vs. collections of
metrics that should be optimized with columnar form vs. slowly-changing
dimensional datasets like threat intel tables).

An orchestration layer outside of the Zed lake is responsible for defining
policy over
how data is ingested and committed and rolled up.  Depending on the
use case and workflow, we envision that some amount of overlapping data objects
would persist at small scale and always be "mixed in" with other overlapping
data during any key-range scan.

> Note: since this style of data organization follows the LSM pattern,
> how data is rolled up (or not) can control the degree of LSM write
> amplification that occurs for a given workload.  There is an explicit
> tradeoff here between overhead of merging overlapping objects on read
> and LSM write amplification to organize the data to avoid such overlaps.
>
> Note: we are showing here manual, CLI-driven steps to accomplish these tasks
> but a live data pipeline would automate all of this with orchestration that
> performs these functions via a service API, i.e., the same service API
> used by the CLI operators.
Apr 07 '22 03:04 mccanne