Allow Retrieval of a Subset of Kafka Message from a Topic
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Use Cases
The tool kcat allows you retrieve a subset of message from a topic.
It would be useful to have this feature in vector for debugging instead of consuming and entire topic, currently 300GB+, or waiting for specific events to happen.
kcat allows for a variety of different scenarios with the -o flag.
Consumer options:
-o <offset> Offset to start consuming from:
beginning | end | stored |
<value> (absolute offset) |
-<value> (relative offset from end)
s@<value> (timestamp in ms to start at)
e@<value> (timestamp in ms to stop at (not included))
Read the last 2000 messages from 'syslog' topic, then exit
$ kcat -C -b mybroker -t syslog -p 0 -o -2000 -e
Consume messages between two timestamps
$ kcat -b mybroker -C -t mytopic -o s@1568276612443 -o e@1568276617901
Attempted Solutions
I tried looking for configuration option in the kafka source and librdkafka documentation but was unable to find the correct setting.
Proposal
Replicate the functionality from the -o flag in kcat by adding additional configuration options.
References
No response
Version
vector 0.44.0 (x86_64-unknown-linux-gnu 3cdc7c3 2025-01-13 21:26:04.735691656)
there already exists auto_offset_reset configuration item, so i think the offset range is not particularly necessary for test
If offsets for consumer group do not exist, set them using this strategy.
See the librdkafka documentation for the auto.offset.reset option for further clarification.
My Kafka topic has logs for about 200-300 different micro services and retains them for 48 hours. There is a lot of data. I need to debug my instance of vector and how it's consuming the logs from only one of those services. Using auto_offset_reset only lets me use earliest, latest etc which doesn't work. I need to pinpoint a specific problematic log using the offset id or narrow down the search to a specific timeframe.