materialize icon indicating copy to clipboard operation
materialize copied to clipboard

[Epic] Region-allocated data on dataflow edges

Open uce opened this issue 3 years ago • 0 comments

Goals

Exchange region-allocated data on dataflow edges.

At the moment, most dataflow edges in Materialize transfer some combination of heap-allocated rows. Each row has limited inline space, and spills to the heap once it's requested to store more than the inline capacity. This comes with a performance trade-off: It simplifies data representation (one can use Vec to carry many rows around), but comes at a cost of increased chatter with the allocator.

This causes at least two problems. The first is that we're transferring allocations between threads. We're using jemalloc, which tries to localize allocations per-thread to avoid synchronization. Moving allocations between threads eventually requires synchronization to consolidate allocations. The cost of this is somewhat proportional to the number of allocations. Secondly, vectors have good sequential iteration performance on the contents of their data, but this doesn't translate to good sequential iteration performance on dereferencing the pointers to heap-allocated rows. Ensuring that the pointers are sequential in memory can yield faster iteration, although we can't say how much of this will be visible in Materialize.

A third concern is with the current implementation of region-allocations using columnation. The library is inherently unsafe to use and we've repeatedly ran into issues due to incorrect uses, causing memory leaks and corruption.

Proposed solution

We're solving the problem from the ground up, introducing a new allocator abstraction, and plugging it through the layers of the dataflow system. Specifically, it consists of (at least) the following tasks:

Tasks

  • [x] Implement abstract behavior for region-allocated data. Currently captured as a separate crate in https://github.com/antiguru/flatcontainer. Implemented.
  • [x] Allow passing containers through Timely. Implemented. Many operators have a Core variant that is generic over the container type. Suitable operators and builders expose a container builder parameter that allows the user to fill in their desired container builder.
  • [ ] Differential end-to-end container support.
  • [x] .. Arrangements. Can absorb non-vector containers, maintain them in a merge batcher and feed into batches.
  • [x] .. Joins. Deferring construction of output streams to a container builder.
  • [ ] .. Half joins
  • [ ] Materialize to support flat containers
  • [x] .. in logging. Reachability logging uses flat containers on dataflow edges.
  • [ ] .. region implementation for Row
  • [ ] .. rendering

Testing

Bugs

Related

  • Blocked by #13036.
  • (soft) #13037

Subsumes

  • https://github.com/MaterializeInc/materialize/issues/25286
  • https://github.com/MaterializeInc/materialize/issues/23332

Questions

uce avatar Jun 10 '22 19:06 uce