brooklin
brooklin copied to clipboard
S3 connector
Is your feature request related to a problem? Please describe.
Similarly to the file connector, ingesting data from S3 would be fantastic.
S3 can emit notifications of new files onto SQS, Kinesis, etc. so it may be beneficial to hook in there.
Essentially, it would be great if Brooklin could be notified of new S3 files and then ingest the actual files, so we can output them onto Kafka.
It may be necessary to differentiate between different types of files
- Plain-text line-by-line
- Single-line JSON objects
- Pretty-printed JSON
Finally, using import java.util.zip.{ GZIPInputStream, ZipInputStream }
, files could be unarchived on-the-fly.
Describe the solution you'd like
Provide the system with an S3 bucket and credentials.
New S3 files will be streamed into data sink (the AWS REST API allows actual streaming of files). Depending on type of file, apply different logic to unarchive/read (see above).
I'd like to have the file streamed into separate Kafka messages depending on the above logic.
For example:
- New file
foo.tar.gz
is written to S3 - Notification is emitted by AWS
- File is streamed into Brooklin
- File is automatically unarchived using
GZIPInputStream
- File contains
1.json
and2.json
, which have pretty printed JSON objects inside - Send each JSON object from each of the file in a separate message onto Kafka
Describe alternatives you've considered
- Custom implementation of the above logic using an SQS client and Kafka Streams
- Kafka Connect has an S3 connector but the officially supported one only allows Kafka -> S3, not S3 as a source
Additional context
This would be an extremely valuable connector when working with systems that can export their data feeds to S3.