seatunnel
seatunnel copied to clipboard
[ST-Engine][LogicalPlan] Generate LogicalPlan from JobConfig file
Search before asking
- [X] I had searched in the feature and found no similar feature requirement.
Description
We define job pipeline by job config file, So the first thing the SeaTunnelClient need to do parse the job config file and generate the action list. action
is similar to the operator
in Flink, It is an encapsulation of the SeaTunnel API
. An action contains an instance of SeaTunnelSource
or SeaTunnelTransform
or SeaTunnelSink
. Every action need to know the upstream of it self.
public interface Action extends Serializable {
@NonNull
String name();
void setName(@NonNull String name);
@NonNull
List<Action> upstream();
void addUpstream(@NonNull Action action);
}
We support SourceAction
、SinkAction
and TransformAction
yet.
We have two profile to parse job config file.
If there is only one Source and one Sink and at most one Transform, We simply build actions pipeline in the following order.
data:image/s3,"s3://crabby-images/0db8b/0db8bc41fc868cfb9ab4b39e7b621ca2d41d0c3b" alt="image"
If there are multiple sources or multiple transforms or multiple sink, We will rely on source_table_name
and result_table_name
to build actions pipeline. So in this case result_table_name
is necessary for the Source Connector
and all of result_table_name
and source_table_name
are necessary for Transform Connector
.
By the end, source_table_name
is necessary for Sink Connector.
data:image/s3,"s3://crabby-images/7ea28/7ea282fb1ec86d4a1aec7d265a41d85ff19ea4d8" alt="image"
Parallelism
In SeaTunnel Engine, only Source
and PartitionTransform
Connector support set parallelism
. The action which can not set parallelism
will extends the upstream action parallelism
. If an action have more than one upstream action, The parallelism
of this action is the sum of the parallelism of its upstream action. Because if an action's parallelism
equals to it's upstream's parallelism
, the action can be chain into it's upstream action and run in a same subtask.
data:image/s3,"s3://crabby-images/917d7/917d7c29e7e864005a147fea0dc7fadc7eb93d46" alt="image"
Usage Scenario
No response
Related issues
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
hi ,please Why does the parallelism of the parallel transformation change from 4 to 2?
hi ,please Why does the parallelism of the parallel transformation change from 4 to 2?
In our design, Only Source
and PartitionTransform
Connector can set parallelism. The PartitionTransform
Connector is used to repartition data and parallelism
is the target partition number.
data:image/s3,"s3://crabby-images/a09ac/a09acf5aa12708aed07a1f19b187993f411f1f42" alt="image"
I can't understand what PartitionTransform
does in the diagram.
If there is a mechanism for Transform4
to distribute data to PartitionTransform
, then why don't we distribute to Sink
directly?
Wouldn't it be simpler to let the user specify the parallelism of the Sink
in this case?