RADAR-Backend
RADAR-Backend copied to clipboard
Reliability vs high throughput
Kafka messages can be consumed according to two different strategies:
- at most once messages can be lost due to consumer faults
- at least once messages can be read twice due to consumer faults
This design choice drives the consumer implementation.
We may plan to code both approaches, and select the most suitable after tests.
EDIT @blootsvoets: reversed at most and at least
Since we add timestamps to all messages, I'd opt for at least once processing to start, we can always establish unicity by the timestamp. It may be harder to retrieve lost data.
Yes but this approach open another issue.
We would use group of consumer to maximise the throughput. Groups implies that Kafka periodically and automatically re-assigns partitions across consumers inside a single group for balancing the work-load. This results in a new generation of the group. After the rebalance, Consumer_X may manage partitions previously handled by consumer_Y. The last seen timestamp per each key cannot be stored locally (e.g. inside each consumer), this information has to be stored in a data-structure shared among all group members. A NonBlockingHashMap could solve this issue.
@fnobilia Presumably this is going to depend on the requirements of the Consumer / use case? I think At-Least-Once is generally recommended though.
https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIgetexactly-oncemessagingfromKafka? As far as I'm aware Exactly-Once-Consumption can be done but Consumers need to locally transactionally checkpoint. However in practice this overhead also places constraints on throughput.
Also take a look at Flink Checkpointing http://data-artisans.com/kafka-flink-a-practical-how-to/
Unfortunately the exactly-once is under active investigation (1 and 2).
Apache Flink may be useful in near future to provide fault-tolerance and scalability within the analysis layer.