eel-sdk
eel-sdk copied to clipboard
Support Distributed writes with EEL
Support Distributed writes with EEL
- N writers via JdbcSource -> KafkaSink
- N Writers via HiveSink/KuduSink/HBaseSink
- Now what if the HiveSink and others that use a LinkedBlockingQueue to service multiple writer threads could do this in a distributed fashion by wrapping the LinkedBlockingQueue interface, i.e. an implementation that wraps a Kafka topic - default one would still remain as threads?
- The gotcha is that when you are out-of-process you lose control on how to partition the data into reasonable sizes
- However for row oriented storage systems like Kudu and HBase it's perfect - the same usage pattern would even work for the JdbcSink
What do you think?
I don't understand sorry. At the moment HiveSink has multiple writers, through multiple threads. Are you meaning something else?
Probably not such a great idea as there maybe better approaches.
I understand the multiple thread writers but what if you could flip a switch to share the writes over stateless worker processes, i.e. spin up Kafka/YARN workers
- Each YARN worker accepts rows over say Kafka and uses EEL to write
You can, but its then no longer a standalone in process API and turning more into spark.