brooklin S3 connector

S3 connector

Open NeQuissimus opened this issue 4 years ago • 0 comments

Is your feature request related to a problem? Please describe.
Similarly to the file connector, ingesting data from S3 would be fantastic. S3 can emit notifications of new files onto SQS, Kinesis, etc. so it may be beneficial to hook in there.

Essentially, it would be great if Brooklin could be notified of new S3 files and then ingest the actual files, so we can output them onto Kafka.

It may be necessary to differentiate between different types of files

Plain-text line-by-line
Single-line JSON objects
Pretty-printed JSON

Finally, using import java.util.zip.{ GZIPInputStream, ZipInputStream }, files could be unarchived on-the-fly.

Describe the solution you'd like
Provide the system with an S3 bucket and credentials. New S3 files will be streamed into data sink (the AWS REST API allows actual streaming of files). Depending on type of file, apply different logic to unarchive/read (see above). I'd like to have the file streamed into separate Kafka messages depending on the above logic.

For example:

New file foo.tar.gz is written to S3
Notification is emitted by AWS
File is streamed into Brooklin
File is automatically unarchived using GZIPInputStream
File contains 1.json and 2.json, which have pretty printed JSON objects inside
Send each JSON object from each of the file in a separate message onto Kafka

Describe alternatives you've considered

Custom implementation of the above logic using an SQS client and Kafka Streams
Kafka Connect has an S3 connector but the officially supported one only allows Kafka -> S3, not S3 as a source

Additional context
This would be an extremely valuable connector when working with systems that can export their data feeds to S3.

Jul 17 '19 14:07 NeQuissimus

brooklin brooklin copied to clipboard

S3 connector

brooklin
brooklin copied to clipboard