clickhouse-sink-connector icon indicating copy to clipboard operation
clickhouse-sink-connector copied to clipboard

The sink-connector lightweight stops replicating

Open aadant opened this issue 1 year ago • 20 comments

last log before it stops :

154629410 2023-11-16 22:27:30.332 [pool-15502-thread-5] INFO com.altinity.clickhouse.sink.connector.db.DbWriter  - *************** EXECUTED BATCH Successfully Records: 12************** task(0) Thread ID: pool-15502-thread-5
Nov 16, 2023 10:29:15 PM com.github.shyiko.mysql.binlog.BinaryLogClient$5 run
INFO: Keepalive: Trying to restore lost connection to dbl1:3306

Restarting the container fixes the issue.

aadant avatar Nov 17 '23 16:11 aadant

Steps to reproduce: Stop MySQL server when sink connector is already running and replicating, The exception is not handled.

subkanthi avatar Nov 21 '23 18:11 subkanthi

The issue comes from here https://github.com/shyiko/mysql-binlog-connector-java/blob/dd710a5466381faa57442977b24fceff56a0820e/src/main/java/com/github/shyiko/mysql/binlog/BinaryLogClient.java#L830

aadant avatar Nov 27 '23 06:11 aadant

I suspect that there is a deadlock and that the connection waits forever to close. This can be confirmed by a java thread dump.

The solution found for such a situation is to restart debezium when no changes are detected for a while : https://github.com/airbytehq/airbyte/pull/24166/commits/e48492e477d71bea88d9315fb21013698d8891ff

aadant avatar Nov 27 '23 06:11 aadant

see also https://github.com/osheroff/mysql-binlog-connector-java/issues/86

aadant avatar Nov 27 '23 22:11 aadant

Store the timestamp of the last record and if its more than 5/10 mins then restart Debezium Engine.

subkanthi avatar Nov 27 '23 22:11 subkanthi

root cause traced to this issue https://github.com/osheroff/mysql-binlog-connector-java/issues/133 (reporter)

aadant avatar Nov 29 '23 02:11 aadant

looks like a OpenJDK 11 bug, @subkanthi let us upgrade !

aadant avatar Dec 01 '23 03:12 aadant

Reproduced with OpenJDK 17 see https://github.com/osheroff/mysql-binlog-connector-java/issues/133

aadant avatar Dec 08 '23 00:12 aadant

The problem continues after the fix ...

aadant avatar Dec 24 '23 00:12 aadant

From analyzing the thread dump provided, the cause seems to be because of high number of GC threads

 There are 179 Garbage Collection(GC) Threads. This is a very high number of GC Threads. It may affect the application's performance. You may consider reducing GC threads.

https://blog.fastthread.io/2015/09/02/thread-dump-analysis-pattern-several-scavengers/

As shown in the image there are only 15 threads that are created by sink connector image image

subkanthi avatar Jan 12 '24 19:01 subkanthi

There are also 19 threads that are blocked in TIMED_WAITING state

main
PRIORITY : 5

THREAD ID : 0X00007FFFF0024C40

NATIVE ID : 0X7

NATIVE ID (DECIMAL) : 7

STATE : TIMED_WAITING


stackTrace:
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007fdfd03b7508> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:1757)
at io.debezium.connector.base.ChangeEventQueue.poll(ChangeEventQueue.java:275)
at io.debezium.connector.mysql.MySqlConnectorTask.doPoll(MySqlConnectorTask.java:239)
at io.debezium.connector.common.BaseSourceTask.poll(BaseSourceTask.java:180)
at io.debezium.embedded.EmbeddedEngine.pollRecords(EmbeddedEngine.java:695)
at io.debezium.embedded.EmbeddedEngine.run(EmbeddedEngine.java:475)
at io.debezium.embedded.ConvertingEngineBuilder$2.run(ConvertingEngineBuilder.java:229)
at com.altinity.clickhouse.debezium.embedded.cdc.DebeziumChangeEventCapture.setupDebeziumEventCapture(DebeziumChangeEventCapture.java:483)
at com.altinity.clickhouse.debezium.embedded.cdc.DebeziumChangeEventCapture.setup(DebeziumChangeEventCapture.java:512)
at com.altinity.clickhouse.debezium.embedded.ClickHouseDebeziumEmbeddedApplication.start(ClickHouseDebeziumEmbeddedApplication.java:212)
at com.altinity.clickhouse.debezium.embedded.ClickHouseDebeziumEmbeddedApplication.main(ClickHouseDebeziumEmbeddedApplication.java:88)

subkanthi avatar Jan 12 '24 20:01 subkanthi

https://docs.oracle.com/en/java/javase/17/gctuning/parallel-collector1.html#GUID-5A7866BE-59DF-44AD-B51A-274DE3F2BF59

subkanthi avatar Jan 16 '24 20:01 subkanthi

Workaround Solution: These two configuration variables restart.event.loop and restart.event.loop.timeout.period.secs define the restart logic. The debezium event loop will be restarted when the last timestamp of the record received exceeds the restart.event.loop.timeout.period.secs

Also the debezium event loop is restarted when the timestamp of the record stored in altinity_sink_connector.replica_source_info exceeds the threshold.

subkanthi avatar Feb 07 '24 19:02 subkanthi

The issue was manually verified by the Altinity QA team and marked as qa-verified.

Build used for testing: altinityinfra/clickhouse-sink-connector:464-c979a8c2583ff701dd5b63efb8f36f7fff8111e7-lt

We've verified that setting restart.event.loop to true and specifying seconds values in restart.event.loop.timeout.period.secs configuration results in a CDC event loop restart with a corresponding log message being shown when the timeout is reached.

Values specified in restart.event.loop.timeout.period.secs configuration are respected. The usage of these configurations doesn't seem to break any other clickhouse-sink-connector-lightweight functionality.

Selfeer avatar Feb 26 '24 19:02 Selfeer

@subkanthi a workaround is in place but this still impacts the sink-connector latency (up to restart.event.loop.timeout.period.secs), IMHO the root cause should be chased down with Debezium folks and possibly the binary log library they use.

aadant avatar Apr 01 '24 18:04 aadant

Analyzing the latest thread dump, all the 8 threads in change-event-source-coordinator are in TIMED_WAITING state.

debezium-mysqlconnector-embeddedconnector-change-event-source-coordinator
PRIORITY : 5

THREAD ID : 0X00007FDDF07899B0

NATIVE ID : 0X10A

NATIVE ID (DECIMAL) : 266

STATE : TIMED_WAITING


stackTrace:
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007fdfd04475f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:1672)
at java.util.concurrent.ThreadPoolExecutor.awaitTermination([email protected]/ThreadPoolExecutor.java:1464)
at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination([email protected]/Executors.java:743)
at com.github.shyiko.mysql.binlog.BinaryLogClient.awaitTerminationInterruptibly(BinaryLogClient.java:1337)
at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateKeepAliveThread(BinaryLogClient.java:1329)
at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnect(BinaryLogClient.java:1314)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:1001)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:74)
at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:272)
at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:197)
at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:137)
at io.debezium.pipeline.ChangeEventSourceCoordinator$$Lambda$727/0x00000008010f77b8.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:539)
at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
at java.lang.Thread.run([email protected]/Thread.java:833)

subkanthi avatar Apr 02 '24 16:04 subkanthi

Similar issue in debezium zulipchat https://debezium.zulipchat.com/#narrow/stream/348104-community-mysql/topic/2.2E5.2E1.2C.20stuck.20indefinitely.20until.20connector.20restart

Alexandru Cristu
3:46 AM
seems to happen after we see this: com.github.shyiko.mysql.binlog.BinaryLogClient$5 run
INFO: Keepalive: Trying to restore lost connection to ...

Task thread is waiting in doPoll:

"task-thread-source_65afd9cf169951285b70c4a1-0" #79816 prio=5 os_prio=0 cpu=2977135.32ms elapsed=79038.87s tid=0x00007f69b4205540 nid=0x13922 waiting on condition  [0x00007f6958504000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
    - parking to wait for  <0x00000005ad829900> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:1757)
    at io.debezium.connector.base.ChangeEventQueue.poll(ChangeEventQueue.java:275)
    at io.debezium.connector.mysql.MySqlConnectorTask.doPoll(MySqlConnectorTask.java:252)
    at io.debezium.connector.common.BaseSourceTask.poll(BaseSourceTask.java:182)
    at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.poll(AbstractWorkerSourceTask.java:469)
    at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.execute(AbstractWorkerSourceTask.java:357)
    at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:204)
    at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:259)
    at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.run(AbstractWorkerSourceTask.java:77)
    at org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:181)
    at org.apache.kafka.connect.runtime.isolation.Plugins$$Lambda$953/0x00007f6a65b1cc50.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:539)
    at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
    at java.lang.Thread.run([email protected]/Thread.java:840)
The "change-event-source-coordinator" thread is stuck in BinaryLogClient.disconnect, which in turn is stuck waiting for the keepalive thread to finish:

"debezium-mysqlconnector-source_65afd9cf169951285b70c4a1-change-event-source-coordinator" #397 prio=5 os_prio=0 cpu=63726.27ms elapsed=482672.61s tid=0x00007f6a54458510 nid=0x1fd waiting on condition  [0x00007f6961e9c000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x000000050c001a90> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:1672)
        at java.util.concurrent.ThreadPoolExecutor.awaitTermination([email protected]/ThreadPoolExecutor.java:1464)
        at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination([email protected]/Executors.java:743)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.awaitTerminationInterruptibly(BinaryLogClient.java:1337)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateKeepAliveThread(BinaryLogClient.java:1329)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnect(BinaryLogClient.java:1314)
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:1001)
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:74)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:272)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:197)
        at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:137)
        at io.debezium.pipeline.ChangeEventSourceCoordinator$$Lambda$1777/0x00007f6a660beac0.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:539)
        at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
        at java.lang.Thread.run([email protected]/Thread.java:840)
The keepalive thread is stuck waiting for SSLSocket to close:

"blc-keepalive-production-aurora-80-cluster.cluster-cfqdxwvdz9bt.us-east-1.rds.amazonaws.com:3306" #401 prio=5 os_prio=0 cpu=56.99ms elapsed=569119.58s tid=0x00007f69d8079dc0 nid=0x201 waiting on condition  [0x00007f69624a3000]
   java.lang.Thread.State: WAITING (parking)
    at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
    - parking to wait for  <0x000000050c032238> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:715)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:938)
    at java.util.concurrent.locks.ReentrantLock$Sync.lock([email protected]/ReentrantLock.java:153)
    at java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:322)
    at sun.security.ssl.SSLSocketImpl.closeNotify([email protected]/SSLSocketImpl.java:735)
    at sun.security.ssl.TransportContext.closeNotify([email protected]/TransportContext.java:287)
    at sun.security.ssl.TransportContext.initiateOutboundClose([email protected]/TransportContext.java:598)
    at sun.security.ssl.TransportContext.closeOutbound([email protected]/TransportContext.java:582)
    at sun.security.ssl.SSLSocketImpl.shutdownOutput([email protected]/SSLSocketImpl.java:875)
    at com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close(PacketChannel.java:125)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel(BinaryLogClient.java:1361)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1345)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.access$800(BinaryLogClient.java:96)
    at com.github.shyiko.mysql.binlog.BinaryLogClient$5.run(BinaryLogClient.java:894)
    at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:539)
    at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
    at java.lang.Thread.run([email protected]/Thread.java:840)
Restarting the connector caused streaming to resume, the above threads remain stuck. Restart is currently needed every few hours, but it seems to become more frequent, at some point will need to restart the JVM to clean up these leaked threads.

Investigating potentially related JDK issue https://bugs.openjdk.org/browse/JDK-8241239?focusedId=14646711&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14646711

Any suggestions how to troubleshoot this further? Have you seen this before?

subkanthi avatar Apr 02 '24 16:04 subkanthi

Debezium issue - https://issues.redhat.com/browse/DBZ-7570

subkanthi avatar Apr 02 '24 16:04 subkanthi

said:

It's on my radar for 2.7.0.Alpha1 Thanks Chris, is there any workarounds for now, like mysql version change maybe as discussed in the other thread?

Chris Cranford (Naros): said:

Thanks Chris, is there any workarounds for now, like mysql version change maybe as discussed in the other thread?

Unfortunately nothing really at this time.

subkanthi avatar Apr 02 '24 17:04 subkanthi

Even though debezium team addressed the issue in 2.7.0 https://issues.redhat.com/browse/DBZ-7570, they have introduced another bug https://issues.redhat.com/browse/DBZ-7873

subkanthi avatar May 26 '24 21:05 subkanthi