lake manager for time-based log workloads

Open mccanne opened this issue 3 years ago • 0 comments

The Zed lake has been designed so that different workloads can manage lake data structures in different fashions. For example, when and how data is compacted, converted to columnar form, indexed, etc can all be driven by external entities via the API.

As an example, to currently create search indexes, rules and indexing of data objects must be run manually with the zed index command.

In this task, we will implement an initial agent for managing a lake for workloads consisting of time-based event logs that are to be searched.

Here, a pool can managed with a zed manage process. This will create a process that runs continuously and looks for commits on one or more pools to schedule activities like compactions and indexing. The manager task will be configured with a yaml file to direct the policies for compactions and index rules.

For example, when deciding to compact data, the manager would wait till newish data "settles down" before trying to compact any overlapping data.

Also, when a compaction event occurs, the assumption for this workload is that the data is not likely to have more overlapping arrivals in the future so this would also be a good time to schedule indexing, columnar conversion, and metadata construction of the DAG optimizer.

May 29 '22 22:05 mccanne