eel-sdk icon indicating copy to clipboard operation
eel-sdk copied to clipboard

An EEL sink for Flume

Open hannesmiller opened this issue 9 years ago • 4 comments
trafficstars

The experimental kite data set sink exist for Flume 1.6.0 which looks on the face of it has the capability of ingesting directly into Hive tables.

  • See the following for a comprehensive list of flume sinks: https://flume.apache.org/FlumeUserGuide.html#flume-sinks
  • See the following for a comprehensive list of flume sources: https://flume.apache.org/FlumeUserGuide.html#flume-sources
  • The canonical in-memory format of a flume event is simply called Event (see https://flume.apache.org/releases/content/1.2.0/apidocs/org/apache/flume/Event.html) - it consists of a body and headers - note for the AVRO source this is marshalled across the wire as AVRO binary.

Headers are usually used for content based routing and multiplexing an event to different sinks.

  • When writing the custom EEL sink one takes an event of the channel (queue), interprets the headers if necessary, transforms the body (payload) into an EEL frame so that it can be passed directly into an EEL sink.

See https://flume.apache.org/FlumeUserGuide.html#kite-dataset-sink

  • Now my guess without looking at the source code is that the Kite Dataset Sink extracts the byte stream from the body and deserialises it to a AVRO GenericRecord which in turn can be passed directly into the Kite write API.

I think for EEL we should do something similar:

  1. Deserialise the payload to a GenericRecord
  2. Transform the GenericRecord to an EEL frame - note from each GenericRecord you can ascertain the AVRO schema so it should be trivial to convert to a Frame schema.
  3. Pass the frame to the EEL sink.
  • Client side you can send events via the AVRO or Thrift RPC client - see https://flume.apache.org/FlumeDeveloperGuide.html#rpc-client-interface

There are various options for batching up events and sending securely over SSL - you could even send via Kafka to a Flume Kafka Source

hannesmiller avatar Sep 20 '16 06:09 hannesmiller

You can assign this one to me.

hannesmiller avatar Feb 17 '17 16:02 hannesmiller

Do we still want a flume connector? I think flume is more suited to streaming data, running on a continuous basis, and eel is very much batch based.

sksamuel avatar Jul 11 '17 13:07 sksamuel

I am fine with this...Yeah I guess its main purpose is for streaming l, but I have used it for ingesting large batches of events (rows) into HDFS.

My initial thoughts is you could have a scenario like this...

JdbcSource -> FlumeSink

  1. The EEL FlumeSink would accept rows and convert each row into a FlumeEvent, then send onto a FlumeAgent
  2. The FlumeSink wraps a FlumeClient - there are canned ones for AvroRpc, ThriftRpc, Kafka, JMS, etc... Note events can be batched to mitigate the number of RPCs
  3. The FlumeAgent itself can be configured to accept events on an AvroSource which in turn routes it to one of its canned sinks, e.g. HdfsSink - you can even write your own custom EelSink

With Flume there's no need for Hadoop to be installed on the client machine - there are other features I haven't touched upon.

Kite have written an interceptor (Morphlines) which is invoked before events hit the sink...they have a bunch of modules and a DSL for transform events.

hannesmiller avatar Jul 13 '17 05:07 hannesmiller

Ok lets do a flume sink.

sksamuel avatar Jul 14 '17 19:07 sksamuel