eel-sdk icon indicating copy to clipboard operation
eel-sdk copied to clipboard

Support Distributed writes with EEL

Open hannesmiller opened this issue 8 years ago • 3 comments

Support Distributed writes with EEL

  • N writers via JdbcSource -> KafkaSink
  • N Writers via HiveSink/KuduSink/HBaseSink
  • Now what if the HiveSink and others that use a LinkedBlockingQueue to service multiple writer threads could do this in a distributed fashion by wrapping the LinkedBlockingQueue interface, i.e. an implementation that wraps a Kafka topic - default one would still remain as threads?
  • The gotcha is that when you are out-of-process you lose control on how to partition the data into reasonable sizes
  • However for row oriented storage systems like Kudu and HBase it's perfect - the same usage pattern would even work for the JdbcSink

What do you think?

hannesmiller avatar Feb 22 '17 20:02 hannesmiller

I don't understand sorry. At the moment HiveSink has multiple writers, through multiple threads. Are you meaning something else?

sksamuel avatar Feb 23 '17 02:02 sksamuel

Probably not such a great idea as there maybe better approaches.

I understand the multiple thread writers but what if you could flip a switch to share the writes over stateless worker processes, i.e. spin up Kafka/YARN workers

  • Each YARN worker accepts rows over say Kafka and uses EEL to write

hannesmiller avatar Feb 23 '17 09:02 hannesmiller

You can, but its then no longer a standalone in process API and turning more into spark.

sksamuel avatar Feb 23 '17 10:02 sksamuel