streamline
streamline copied to clipboard
Streamline - Flink Runner
Following up with https://github.com/hortonworks/streamline/issues/534 I will open an issue here to discuss flink support.
What API would flink need to implement? How much work needs to be done?
Probably support for apache beam would be most interesting.
@arunmahadevan can you outline the work and it would be good to create sub-issues if possible.
At a high level the Streamline topology is represented as a TopologyDAG that captures the sources, sinks and the operations users intend to perform.
During deployment, the topology DAG gets translated into the respective runtime. The runtime providers needs to implement the TopologyActions interface. The framework invokes TopologyActions.deploy() passing the DAG as an argument. This is where the runtime specific translations take place. In case of Storm, StormTopologyActionsImpl translates the DAG into storm specific flux representation and then submits it into storm cluster. Similar implementation will be needed for Flink (E.g. FlinkTopologyActionsImpl).
The topology DAG has a traverse method that accepts a TopologyDagVisitor and will call back the appropriate visit methods as the different components in the topology DAG are traversed. In case of storm this is StormTopologyFluxGenerator. Similar implementation will have to be provided for Flink (E.g. FlinkDataStreamGenerator).
At a high level the topology DAG contains implementations of sources (StreamlineSource), sinks (StreamlineSink) and processors(StreamlineProcessor). These capture the operation user intends to perfom and the necessary design time configuration (what the user configured via the UI). The runtime needs to provide respective runtime translations for the different design time components.
I list down the currently supported sources, sinks and processors which needs to be translated in separate subtasks. To start with we could support only a subset of sources and sinks. E.g. only Kafka source and Kafka sink.
The different processors will have to be supported. Right now we have Rules, Join, PMML and Custom processors in the backend. The different processors have runtime components that can be reused while building the DataStream pipeline. E.g. The rules are evaluated via RuleProcessorRuntime which can be re-used (say within a datastream.flatMap). Maybe we can start with RuleProcessor, I will create separate tasks processors as well.
This is at a high level, we may need to refactor or come up with new interfaces within Streamline itself to support flink, which we can figure out as we proceed.
@harshach, created an initial set of sub-tasks to get started. We could start with a basic kafka-rule-kafka topology for flink and add more stuff once this works.
Don't we want to support Apache Beam instead of Flink ?
@harshach @arunmahadevan what are the instructions to register Flink as another stream engine in the service pool, so the web UI can recognize it? what are the configurations need to be added or change? Thanks.
@harshach @arunmahadevan what are the instructions to register Flink as another stream engine in the service pool, so the web UI can recognize it? what are the configurations need to be added or change? Thanks.
any updates?