santa icon indicating copy to clipboard operation
santa copied to clipboard

Parquet output

Open the80srobot opened this issue 7 months ago • 9 comments

I would like to add parquet output support to Santa, however there are some trade-offs that might not be acceptable to you. I'd like to have the discussion and the pros/cons in one place (this issue).

Goals

  • Decide if Santa would like a patch series adding parquet support
  • If yes, decide on the approach

Why parquet?

Parquet is the de-facto standard interchange columnar format - most data platforms can ingest it natively, it supports fairly rich schemas and has decent performance.

Most Santa deployments are probably converting logs to a columnar format on the backend to get more efficient storage and compute. Those same advantages already apply on the host itself - a parquet file is going to be smaller than the equivalent fsspool folder, and will therefore use less compute on network IO. (This trade-off was already known to the designers of protocol buffers, hence variadic integer encoding.)

In summary: parquet support makes it easier to adopt Santa for people who already use the format on their backend. This is already a good reason to adopt it. Additionally, it may turn out to save CPU time and bandwidth for existing users of protobuf + fsspool

Why not parquet?

Briefly, code quality and dependency size. The main implementation of parquet is in the Apache Arrow library, which is complex and has a lot of dependencies, including thrift and boost. The codebase itself is large with no obvious modularization or layering - even though different parts have different coding styles and build options (e.g. exceptions vs no exceptions), they are heavily interdependent in all directions. The external dependencies are mostly transparently fetched from upstreams, or expected to be installed on the system, which largely breaks reproducible builds and adds a supply chain problem. All of this makes it difficult to add Arrow as a dependency.

Building Santa with parquet support would likely add ~20 MiB to binary size and require at least the following extra dependencies:

  • Apache Arrow
  • Thrift
  • Boost
  • Snappy
  • Zlib
  • Xz
  • Zstd

Implementation sketch

We can add the dependencies to WORKSPACE as http_archive, and check in a BUILD file into external_patches. Here's what that looks like for thrift:

http_archive(
    name = "thrift",
    build_file = "//external_patches/thrift:BUILD",
    sha256 = "5da60088e60984f4f0801deeea628d193c33cec621e78c8a43a5d8c4055f7ad9",
    strip_prefix = "thrift-0.13.0",
    urls = [
        "https://github.com/apache/thrift/archive/v0.13.0.tar.gz",
    ],
)

The BUILD file can shell out to make or cmake, or just use cc_library. It looks like the latter just works for most libraries.

A serializer would need to keep a reasonable number of recent messages in memory, already converted to column chunks. A single parquet file can contain one or multiple chunks. To begin with, we could tune this to target 1-10 chunks per file, depending on busy the machine is.

File output can use the existing fsspool implementation, just swapping protobuf files for parquet files.

We can avoid building all of the dependencies by bundling them under an @org_apache_arrow target and only depending on that from the new serializer.

Alternatives

Parquet is a standardized, stable format, and implementations other than the official one exist.

It may be worthwhile to implement a minimal version of parquet without all the dependencies. Such a project exists for go, and it could serve as a blueprint. (In fact, we'd only need writer support and don't need the higher-level schema support, so we could end up with an even smaller codebase.)

I'm not sure who has time to do this, though.

the80srobot avatar Nov 08 '23 09:11 the80srobot