pipeline_kafka pipeline

We're going to keep the 0.8.2.2 branch alive for Kafka 0.8. The master branch is only going to be compatible with Kafka 0.9+.

The new pipleline_kafka will have have a completely new API. Some notes about the implementation and features:

Broker discovery via Zookeeper
Offsets will be stored in Kafka, probably backed in PipelineDB as well for usability purposes
The new Consumer API (consumer groups)
N concurrent workers can be spawned where each of them will consume data for all topics configured
Users can create any number of (topic, relation) consume actions and the worker processes will read from all the unique topics and broadcast each topic to all the relations they are mapped to
All workers will run under the same consumer group so they'll automatically get partitions assigned by the group manager

Unanswered questions:

What if users want to replay some data for a particular (topic, relation)? Should be just have adhoc worker processes that do that using the simple API? Basically take arguments that look like (topic, stream, start_offset, end_offset).

@derekjn: Thoughts?

Oct 06 '16 16:10 usmanm

This all sounds good to me. Regarding replay, I think an ephemeral bgworker (or maybe a group of them) will work just fine for that.

The other important thing I think we need to address is packaging. The ZK client library is a huge pain to install on non-Ubuntu systems. There just happens to be an apt repo for zookeeper_mt. Building and installing librdkafka is easy and clean, but zookeeper_mt is not so I don't feel great about deferring that complexity to users.

Some options here:

Include pipeline_kafka with core packages like we used to (ideal option)
Build pipeline_kafka packages that can be downloaded and installed (adds no value for users versus the first option and gives us more things to manage)

The main thing is that adding the ZK dependency means it is no longer reasonable for us to expect users to build and install pipeline_kafka themselves.

Oct 06 '16 17:10 derekjn

We might have to bump this up since currently the way our paralellism works will cause a latency of parallelism * timeout. This is because each partition is consumed independently and we wait for timeout when polling each partition. Super low timeouts burn a lot of CPU so that's not a fix.

Oct 18 '16 02:10 usmanm

For replay, we can probably start off by just running the replay code in the client process.

Oct 28 '16 21:10 usmanm

More suggestions:

Kafka 0.10.1 has now the ability to map timestamps to offsets, (see offsetForTimes at https://kafka.apache.org/0101/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html). It maps nicely to a where query with a "kafka_timestamp" and would be cool if pipelinedb incorporated that (you'll notice that the API returns a map of offsets per partition).
I like the fact you're storing offsets in Pipeline DB instead of Kafka / ZK because it will allow you to achieve exactly once semantics. If you plan on using ZK or Kafka for storing offsets then you're going to be either at least once or at most once, which may be an issue for a database
Avro support (integration with the Kafka schema registry).
Broker discovery shouldn't be done using Zookeeper (that's the old ways), but using Kafka (using boostrap servers)

Dec 21 '16 00:12 simplesteph

Thanks for the suggestions @simplesteph! Will incorporate them when writing the new client.

Jan 13 '17 01:01 usmanm

Hi, What about Avro support here ?

Feb 27 '18 14:02 maver1ck

pipeline_kafka rewrite