pinot icon indicating copy to clipboard operation
pinot copied to clipboard

[PEP] stream plugin that supports object-storage micro-batches with optional inline batch support

Open udaysagar2177 opened this issue 1 month ago • 8 comments

Summary:

Introduce a new stream plugin that can handle external files or inline batches for real-time ingestion.

Motivation:

  • Reduce Kafka cross-AZ replication, production, and consumption network costs.
  • Kafka delivers sub-second message latency but requires substantial infrastructure; many use cases can tolerate higher latency at lower cost.
  • High-volume traffic can exceed Kafka broker throughput or retention limits, necessitating complex operational management.

Core Mechanics

  • Partition-level consumers receive micro-batch descriptor records instead of individual events.
  • Each record triggers a controlled background fetch logic that downloads the referenced object via PinotFS.
  • A dedicated thread extracts events using the configured RecordReader and pushes them into a bounded in-memory queue.
  • The LLC consumer remains unchanged and reads from this in-memory queue as if it were a normal streaming source.

Offset Model

  • Offsets use a serialized JSON structure similar to the Kinesis stream plugin.
  • JSON tracks:
    • The Kafka record offset carrying the micro-batch descriptor record.
    • The intra-file (or intra-batch) event offset.
  • Supports replay correctness, restart recovery, and stable start/end offset behavior.

Other advantages:

  • Enables real-time ingestion of Avro or Parquet files without complicating the architecture with Spark or a separate batch ingestion job.
  • Enables improved compression efficiency using inline batches compared to the message-per-event model.

Micro-batch descriptor protocol

  • The micro-batch protocol defines deterministic sub-selection rules.
  • Consumers may extract:
    • The full batch, or
    • Only the assigned-partition subset,
  • Replay semantics remain fully stable.

Expected Outcome

A stream plugin that supports object-storage micro-batches with optional inline batch support, reduces the impact of Kafka, and simplifies ingestion pipelines for file-based formats.

If this proposal aligns with the project’s direction. I would be happy to move it forward and submit a pull request for review.

udaysagar2177 avatar Dec 08 '25 10:12 udaysagar2177