pipeline_kafka rewrite
We're going to keep the 0.8.2.2 branch alive for Kafka 0.8. The master branch is only going to be compatible with Kafka 0.9+.
The new pipleline_kafka will have have a completely new API. Some notes about the implementation and features:
- Broker discovery via Zookeeper
- Offsets will be stored in Kafka, probably backed in PipelineDB as well for usability purposes
- The new Consumer API (consumer groups)
Nconcurrent workers can be spawned where each of them will consume data for all topics configured- Users can create any number of
(topic, relation)consume actions and the worker processes will read from all the unique topics and broadcast each topic to all the relations they are mapped to - All workers will run under the same consumer group so they'll automatically get partitions assigned by the group manager
Unanswered questions:
- What if users want to replay some data for a particular
(topic, relation)? Should be just have adhoc worker processes that do that using the simple API? Basically take arguments that look like(topic, stream, start_offset, end_offset).
@derekjn: Thoughts?
This all sounds good to me. Regarding replay, I think an ephemeral bgworker (or maybe a group of them) will work just fine for that.
The other important thing I think we need to address is packaging. The ZK client library is a huge pain to install on non-Ubuntu systems. There just happens to be an apt repo for zookeeper_mt. Building and installing librdkafka is easy and clean, but zookeeper_mt is not so I don't feel great about deferring that complexity to users.
Some options here:
- Include
pipeline_kafkawith core packages like we used to (ideal option) - Build
pipeline_kafkapackages that can be downloaded and installed (adds no value for users versus the first option and gives us more things to manage)
The main thing is that adding the ZK dependency means it is no longer reasonable for us to expect users to build and install pipeline_kafka themselves.
We might have to bump this up since currently the way our paralellism works will cause a latency of parallelism * timeout. This is because each partition is consumed independently and we wait for timeout when polling each partition. Super low timeouts burn a lot of CPU so that's not a fix.
For replay, we can probably start off by just running the replay code in the client process.
More suggestions:
-
Kafka 0.10.1 has now the ability to map timestamps to offsets, (see offsetForTimes at https://kafka.apache.org/0101/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html). It maps nicely to a where query with a "kafka_timestamp" and would be cool if pipelinedb incorporated that (you'll notice that the API returns a map of offsets per partition).
-
I like the fact you're storing offsets in Pipeline DB instead of Kafka / ZK because it will allow you to achieve exactly once semantics. If you plan on using ZK or Kafka for storing offsets then you're going to be either at least once or at most once, which may be an issue for a database
-
Avro support (integration with the Kafka schema registry).
-
Broker discovery shouldn't be done using Zookeeper (that's the old ways), but using Kafka (using boostrap servers)
Thanks for the suggestions @simplesteph! Will incorporate them when writing the new client.
Hi, What about Avro support here ?