vector icon indicating copy to clipboard operation
vector copied to clipboard

Allow Retrieval of a Subset of Kafka Message from a Topic

Open mlladb opened this issue 6 months ago • 2 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

The tool kcat allows you retrieve a subset of message from a topic.

It would be useful to have this feature in vector for debugging instead of consuming and entire topic, currently 300GB+, or waiting for specific events to happen.

kcat allows for a variety of different scenarios with the -o flag.

Consumer options:
  -o <offset>        Offset to start consuming from:
                     beginning | end | stored |
                     <value>  (absolute offset) |
                     -<value> (relative offset from end)
                     s@<value> (timestamp in ms to start at)
                     e@<value> (timestamp in ms to stop at (not included))

Read the last 2000 messages from 'syslog' topic, then exit

$ kcat -C -b mybroker -t syslog -p 0 -o -2000 -e

Consume messages between two timestamps

$ kcat -b mybroker -C -t mytopic -o s@1568276612443 -o e@1568276617901

Attempted Solutions

I tried looking for configuration option in the kafka source and librdkafka documentation but was unable to find the correct setting.

Proposal

Replicate the functionality from the -o flag in kcat by adding additional configuration options.

References

No response

Version

vector 0.44.0 (x86_64-unknown-linux-gnu 3cdc7c3 2025-01-13 21:26:04.735691656)

mlladb avatar Jun 06 '25 03:06 mlladb

there already exists auto_offset_reset configuration item, so i think the offset range is not particularly necessary for test

auto_offset_reset

If offsets for consumer group do not exist, set them using this strategy.

See the librdkafka documentation for the auto.offset.reset option for further clarification.

pinylin avatar Jun 06 '25 06:06 pinylin

My Kafka topic has logs for about 200-300 different micro services and retains them for 48 hours. There is a lot of data. I need to debug my instance of vector and how it's consuming the logs from only one of those services. Using auto_offset_reset only lets me use earliest, latest etc which doesn't work. I need to pinpoint a specific problematic log using the offset id or narrow down the search to a specific timeframe.

mlladb avatar Jun 06 '25 08:06 mlladb