seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[ST-Engine][LogicalPlan] Generate LogicalPlan from JobConfig file

Open EricJoy2048 opened this issue 2 years ago • 3 comments

Search before asking

  • [X] I had searched in the feature and found no similar feature requirement.

Description

We define job pipeline by job config file, So the first thing the SeaTunnelClient need to do parse the job config file and generate the action list. action is similar to the operator in Flink, It is an encapsulation of the SeaTunnel API. An action contains an instance of SeaTunnelSource or SeaTunnelTransform or SeaTunnelSink. Every action need to know the upstream of it self.

public interface Action extends Serializable {
    @NonNull
    String name();

    void setName(@NonNull String name);

    @NonNull
    List<Action> upstream();

    void addUpstream(@NonNull Action action);
}

We support SourceActionSinkAction and TransformAction yet.

We have two profile to parse job config file.

If there is only one Source and one Sink and at most one Transform, We simply build actions pipeline in the following order.

image

If there are multiple sources or multiple transforms or multiple sink, We will rely on source_table_name and result_table_name to build actions pipeline. So in this case result_table_name is necessary for the Source Connector and all of result_table_name and source_table_name are necessary for Transform Connector. By the end, source_table_name is necessary for Sink Connector.

image

Parallelism

In SeaTunnel Engine, only Source and PartitionTransform Connector support set parallelism. The action which can not set parallelism will extends the upstream action parallelism. If an action have more than one upstream action, The parallelism of this action is the sum of the parallelism of its upstream action. Because if an action's parallelism equals to it's upstream's parallelism, the action can be chain into it's upstream action and run in a same subtask.

image

Usage Scenario

No response

Related issues

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

EricJoy2048 avatar Jul 25 '22 05:07 EricJoy2048

image hi ,please Why does the parallelism of the parallel transformation change from 4 to 2?

2013650523 avatar Jul 27 '22 13:07 2013650523

image hi ,please Why does the parallelism of the parallel transformation change from 4 to 2?

In our design, Only Source and PartitionTransform Connector can set parallelism. The PartitionTransform Connector is used to repartition data and parallelism is the target partition number.

EricJoy2048 avatar Jul 28 '22 01:07 EricJoy2048

image

I can't understand what PartitionTransform does in the diagram. If there is a mechanism for Transform4 to distribute data to PartitionTransform, then why don't we distribute to Sink directly? Wouldn't it be simpler to let the user specify the parallelism of the Sink in this case?

ashulin avatar Jul 28 '22 08:07 ashulin