seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Bug] [Zeta] savePointJob Doesn't work

Open Jetiaime opened this issue 10 months ago • 9 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

I using SeaTunnel Client and wanna to check out the function. I launched a job by ./bin/seatunnel.sh -c /Users/liu/Data/10_Code/iWhalecloud/seatunnel-web/profile/13244296710144.conf and after that, launched a job by ./bin/seatunnel.sh -s 834694399370723329. I waitted for a long time, the submitted job still running, even all my data has been moved.

image image

SeaTunnel Version

2.3.4

SeaTunnel Config

{
    "env" : {
        "job.mode" : "BATCH",
        "job.name" : "SeaTunnel_Job"
    },
    "source" : [
        {
            "password" : "wdp123",
            "driver" : "oracle.jdbc.driver.OracleDriver",
            "parallelism" : "32",
            "query" : "SELECT \"ID\", \"NAME\" FROM \"WHS\".\"TEST3\"",
            "connection_check_timeout_sec" : 30,
            "fetch_size" : "10000",
            "result_table_name" : "Table13244444398848",
            "plugin_name" : "Jdbc",
            "user" : "system",
            "url" : "jdbc:oracle:thin:@10.45.46.116:8085:XE"
        }
    ],
    "transform" : [],
    "sink" : [
        {
            "batch_size" : "10000",
            "max_retries" : "1",
            "source_table_name" : "Table13244444398848",
            "max_commit_attempts" : 3,
            "auto_commit" : "true",
            "plugin_name" : "Clickhouse",
            "url" : "jdbc:clickhouse://10.45.151.152:8123",
            "is_exactly_once" : "false",
            "database" : "AA",
            "password" : "Pass-123-whs",
            "transaction_timeout_sec" : -1,
            "driver" : "ru.yandex.clickhouse.ClickHouseDriver",
            "support_upsert_by_query_primary_key_exist" : "false",
            "Clickhouse" : "true",
            "host" : "10.45.151.152:8123",
            "connection_check_timeout_sec" : 30,
            "generate_sink_sql" : "true",
            "user" : "default",
            "table" : "tb_test3",
            "username" : "default"
        }
    ]
}

Running Command

1. ./bin/seatunnel.sh -c /Users/liu/Data/10_Code/iWhalecloud/seatunnel-web/profile/13244296710144.conf
2. ./bin/seatunnel.sh -s 834694399370723329

Error Exception

savePointJob doesn't work.

Zeta or Flink or Spark Version

No response

Java or Scala Version

No response

Screenshots

No response

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

Jetiaime avatar Apr 22 '24 07:04 Jetiaime

I found that there is a lock competition in org.apache.seatunnel.engine.server.task.flow.SourceFlowLifeCycle#triggerBarrier. When the savepoint barrier run in the synchronized (collector.getCheckpointLock()), it will get the checkpoint lock utill all record has been collected.

Jetiaime avatar May 09 '24 04:05 Jetiaime

Can you give a detailed log of the zeta engine?

happyboy1024 avatar May 09 '24 07:05 happyboy1024

Can you give a detailed log of the zeta engine?

It seems like not the lock competition, but the org.apache.seatunnel.connectors.seatunnel.jdbc.internal.JdbcInputFormat#resultSet.next() always be true. So the lock will not give always utill all ResultSet record has been collected.

image image

Jetiaime avatar May 09 '24 08:05 Jetiaime

Can you give a detailed log of the zeta engine?

It seems like not the lock competition, but the org.apache.seatunnel.connectors.seatunnel.jdbc.internal.JdbcInputFormat#resultSet.next() always be true. So the lock will not give always utill all ResultSet record has been collected.

Yes, I'm not sure if your source table has a primary key because you didn't set the partition column, which may result in only one split being present, and then savepoint is waiting for the currently split executing. So we need to observe some key information from the zeta engine logs.

happyboy1024 avatar May 09 '24 09:05 happyboy1024

Can you give a detailed log of the zeta engine?

It seems like not the lock competition, but the org.apache.seatunnel.connectors.seatunnel.jdbc.internal.JdbcInputFormat#resultSet.next() always be true. So the lock will not give always utill all ResultSet record has been collected.

Yes, I'm not sure if your source table has a primary key because you didn't set the partition column, which may result in only one split being present, and then savepoint is waiting for the currently split executing. So we need to observe some key information from the zeta engine logs.

Yes, it just a one giant split, because I didn't set any special conf, and table just two field: ID and NAME, with no any keys.

Jetiaime avatar May 09 '24 09:05 Jetiaime

Should it be setten a limitation in case a giant split holding the checkpoint lock for a long time ? @hailin0 @Hisoka-X

Jetiaime avatar May 09 '24 09:05 Jetiaime

The minimum granularity of savepoint is split. If you perform a savepoint in the middle of reading a split, the status file obtained may also be wrong.

Hisoka-X avatar May 09 '24 09:05 Hisoka-X

The minimum granularity of savepoint is split. If you perform a savepoint in the middle of reading a split, the status file obtained may also be wrong.

So there is no way to make a savepoint or checkpoint when a Table without any Keys but with a huge records?

Jetiaime avatar May 09 '24 10:05 Jetiaime

The minimum granularity of savepoint is split. If you perform a savepoint in the middle of reading a split, the status file obtained may also be wrong.

So there is no way to make a savepoint or checkpoint when a Table without any Keys but with a huge records?

Just cancel it. Restore can not do anything even you got the right state file.

Hisoka-X avatar May 09 '24 10:05 Hisoka-X