materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Alter Persist's row and error representation to avoid unused memory

Open antiguru opened this issue 2 years ago • 1 comments

The persist API represents rows and errors in memory in a multiplexed format using Result. This causes to bloat the memory required for rows significantly due to the size mismatch of the row and error variants.

Rows are encoded as a 32-byte SmallVec (8 byte variant, 24 byte inline storage). The DataflowError type has a size of 96 bytes. Multiplexing both in an enum requires 104 bytes (8 byte tag, plus max(32, 96), which wastes 64 bytes per row.

It seems there are two solutions:

  • Heap-allocate the error variant through the use of boxed values. This will reduce the space requirements of the error variant to a pointer, i.e., 8 bytes. An additional cost is incurred at runtime to due pointer dereferences, however we assume errors to be rare and so it should be acceptable.
  • Do not multiplex row and error data. Elsewhere, we keep separate streams of row and error data for this purpose, and the persist API could chose the same approach.

antiguru avatar Aug 08 '22 10:08 antiguru

cc @danhhz

aljoscha avatar Aug 08 '22 11:08 aljoscha

@antiguru and @teskje we can now close this one, right?

aljoscha avatar Jan 20 '23 10:01 aljoscha

Solution 1 mentioned in the description is now implemented in https://github.com/MaterializeInc/materialize/pull/17222. We are still wasting 8 bytes when using a Result<Row, DataflowError> vs. passing them separately so the second proposed solution would be preferable.

It's not entirely clear to me how much of an effect that 8-byte overhead has in practice. It depends on whether we regularly pass huge buffers to persist or not. If we don't then spending effort to try and fix this is probably not worth it. On the other hand, the persist API does influence the implementation of its users. For example, persist_sink keeps a corrections: Vec<Result<Row, DataflowError>>, presumably because it is convenient to use with the persist API.

teskje avatar Jan 20 '23 10:01 teskje

@antiguru has a PR that splits them in persist_source, fwiw https://github.com/MaterializeInc/materialize/pull/17156

danhhz avatar Jan 20 '23 15:01 danhhz