eel-sdk
eel-sdk copied to clipboard
An EEL sink for Flume
The experimental kite data set sink exist for Flume 1.6.0 which looks on the face of it has the capability of ingesting directly into Hive tables.
- See the following for a comprehensive list of flume sinks: https://flume.apache.org/FlumeUserGuide.html#flume-sinks
- See the following for a comprehensive list of flume sources: https://flume.apache.org/FlumeUserGuide.html#flume-sources
- The canonical in-memory format of a flume event is simply called Event (see https://flume.apache.org/releases/content/1.2.0/apidocs/org/apache/flume/Event.html) - it consists of a body and headers - note for the AVRO source this is marshalled across the wire as AVRO binary.
Headers are usually used for content based routing and multiplexing an event to different sinks.
- When writing the custom EEL sink one takes an event of the channel (queue), interprets the headers if necessary, transforms the body (payload) into an EEL frame so that it can be passed directly into an EEL sink.
See https://flume.apache.org/FlumeUserGuide.html#kite-dataset-sink
- Now my guess without looking at the source code is that the Kite Dataset Sink extracts the byte stream from the body and deserialises it to a AVRO GenericRecord which in turn can be passed directly into the Kite write API.
I think for EEL we should do something similar:
- Deserialise the payload to a GenericRecord
- Transform the GenericRecord to an EEL frame - note from each GenericRecord you can ascertain the AVRO schema so it should be trivial to convert to a Frame schema.
- Pass the frame to the EEL sink.
- Client side you can send events via the AVRO or Thrift RPC client - see https://flume.apache.org/FlumeDeveloperGuide.html#rpc-client-interface
There are various options for batching up events and sending securely over SSL - you could even send via Kafka to a Flume Kafka Source
You can assign this one to me.
Do we still want a flume connector? I think flume is more suited to streaming data, running on a continuous basis, and eel is very much batch based.
I am fine with this...Yeah I guess its main purpose is for streaming l, but I have used it for ingesting large batches of events (rows) into HDFS.
My initial thoughts is you could have a scenario like this...
JdbcSource -> FlumeSink
- The EEL FlumeSink would accept rows and convert each row into a FlumeEvent, then send onto a FlumeAgent
- The FlumeSink wraps a FlumeClient - there are canned ones for AvroRpc, ThriftRpc, Kafka, JMS, etc... Note events can be batched to mitigate the number of RPCs
- The FlumeAgent itself can be configured to accept events on an AvroSource which in turn routes it to one of its canned sinks, e.g. HdfsSink - you can even write your own custom EelSink
With Flume there's no need for Hadoop to be installed on the client machine - there are other features I haven't touched upon.
Kite have written an interceptor (Morphlines) which is invoked before events hit the sink...they have a bunch of modules and a DSL for transform events.
Ok lets do a flume sink.