clickhouse-sink-connector
clickhouse-sink-connector copied to clipboard
The sink-connector lightweight stops replicating
last log before it stops :
154629410 2023-11-16 22:27:30.332 [pool-15502-thread-5] INFO com.altinity.clickhouse.sink.connector.db.DbWriter - *************** EXECUTED BATCH Successfully Records: 12************** task(0) Thread ID: pool-15502-thread-5
Nov 16, 2023 10:29:15 PM com.github.shyiko.mysql.binlog.BinaryLogClient$5 run
INFO: Keepalive: Trying to restore lost connection to dbl1:3306
Restarting the container fixes the issue.
Steps to reproduce: Stop MySQL server when sink connector is already running and replicating, The exception is not handled.
The issue comes from here https://github.com/shyiko/mysql-binlog-connector-java/blob/dd710a5466381faa57442977b24fceff56a0820e/src/main/java/com/github/shyiko/mysql/binlog/BinaryLogClient.java#L830
I suspect that there is a deadlock and that the connection waits forever to close. This can be confirmed by a java thread dump.
The solution found for such a situation is to restart debezium when no changes are detected for a while : https://github.com/airbytehq/airbyte/pull/24166/commits/e48492e477d71bea88d9315fb21013698d8891ff
see also https://github.com/osheroff/mysql-binlog-connector-java/issues/86
Store the timestamp of the last record and if its more than 5/10 mins then restart Debezium Engine.
root cause traced to this issue https://github.com/osheroff/mysql-binlog-connector-java/issues/133 (reporter)
looks like a OpenJDK 11 bug, @subkanthi let us upgrade !
Reproduced with OpenJDK 17 see https://github.com/osheroff/mysql-binlog-connector-java/issues/133
The problem continues after the fix ...
From analyzing the thread dump provided, the cause seems to be because of high number of GC threads
There are 179 Garbage Collection(GC) Threads. This is a very high number of GC Threads. It may affect the application's performance. You may consider reducing GC threads.
https://blog.fastthread.io/2015/09/02/thread-dump-analysis-pattern-several-scavengers/
As shown in the image there are only 15 threads that are created by sink connector
There are also 19 threads that are blocked in TIMED_WAITING
state
main
PRIORITY : 5
THREAD ID : 0X00007FFFF0024C40
NATIVE ID : 0X7
NATIVE ID (DECIMAL) : 7
STATE : TIMED_WAITING
stackTrace:
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007fdfd03b7508> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:1757)
at io.debezium.connector.base.ChangeEventQueue.poll(ChangeEventQueue.java:275)
at io.debezium.connector.mysql.MySqlConnectorTask.doPoll(MySqlConnectorTask.java:239)
at io.debezium.connector.common.BaseSourceTask.poll(BaseSourceTask.java:180)
at io.debezium.embedded.EmbeddedEngine.pollRecords(EmbeddedEngine.java:695)
at io.debezium.embedded.EmbeddedEngine.run(EmbeddedEngine.java:475)
at io.debezium.embedded.ConvertingEngineBuilder$2.run(ConvertingEngineBuilder.java:229)
at com.altinity.clickhouse.debezium.embedded.cdc.DebeziumChangeEventCapture.setupDebeziumEventCapture(DebeziumChangeEventCapture.java:483)
at com.altinity.clickhouse.debezium.embedded.cdc.DebeziumChangeEventCapture.setup(DebeziumChangeEventCapture.java:512)
at com.altinity.clickhouse.debezium.embedded.ClickHouseDebeziumEmbeddedApplication.start(ClickHouseDebeziumEmbeddedApplication.java:212)
at com.altinity.clickhouse.debezium.embedded.ClickHouseDebeziumEmbeddedApplication.main(ClickHouseDebeziumEmbeddedApplication.java:88)
https://docs.oracle.com/en/java/javase/17/gctuning/parallel-collector1.html#GUID-5A7866BE-59DF-44AD-B51A-274DE3F2BF59
Workaround Solution:
These two configuration variables restart.event.loop
and restart.event.loop.timeout.period.secs
define the restart logic.
The debezium event loop will be restarted when the last timestamp of the record received exceeds the restart.event.loop.timeout.period.secs
Also the debezium event loop is restarted when the timestamp of the record stored in altinity_sink_connector.replica_source_info
exceeds the threshold.
The issue was manually verified by the Altinity QA team and marked as qa-verified
.
Build used for testing: altinityinfra/clickhouse-sink-connector:464-c979a8c2583ff701dd5b63efb8f36f7fff8111e7-lt
We've verified that setting restart.event.loop
to true
and specifying seconds values in restart.event.loop.timeout.period.secs
configuration results in a CDC event loop restart with a corresponding log message being shown when the timeout is reached.
Values specified in restart.event.loop.timeout.period.secs
configuration are respected. The usage of these configurations doesn't seem to break any other clickhouse-sink-connector-lightweight
functionality.
@subkanthi a workaround is in place but this still impacts the sink-connector latency (up to restart.event.loop.timeout.period.secs
), IMHO the root cause should be chased down with Debezium folks and possibly the binary log library they use.
Analyzing the latest thread dump, all the 8 threads in change-event-source-coordinator
are in TIMED_WAITING
state.
debezium-mysqlconnector-embeddedconnector-change-event-source-coordinator
PRIORITY : 5
THREAD ID : 0X00007FDDF07899B0
NATIVE ID : 0X10A
NATIVE ID (DECIMAL) : 266
STATE : TIMED_WAITING
stackTrace:
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007fdfd04475f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:1672)
at java.util.concurrent.ThreadPoolExecutor.awaitTermination([email protected]/ThreadPoolExecutor.java:1464)
at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination([email protected]/Executors.java:743)
at com.github.shyiko.mysql.binlog.BinaryLogClient.awaitTerminationInterruptibly(BinaryLogClient.java:1337)
at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateKeepAliveThread(BinaryLogClient.java:1329)
at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnect(BinaryLogClient.java:1314)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:1001)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:74)
at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:272)
at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:197)
at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:137)
at io.debezium.pipeline.ChangeEventSourceCoordinator$$Lambda$727/0x00000008010f77b8.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:539)
at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
at java.lang.Thread.run([email protected]/Thread.java:833)
Similar issue in debezium zulipchat https://debezium.zulipchat.com/#narrow/stream/348104-community-mysql/topic/2.2E5.2E1.2C.20stuck.20indefinitely.20until.20connector.20restart
Alexandru Cristu
3:46 AM
seems to happen after we see this: com.github.shyiko.mysql.binlog.BinaryLogClient$5 run
INFO: Keepalive: Trying to restore lost connection to ...
Task thread is waiting in doPoll:
"task-thread-source_65afd9cf169951285b70c4a1-0" #79816 prio=5 os_prio=0 cpu=2977135.32ms elapsed=79038.87s tid=0x00007f69b4205540 nid=0x13922 waiting on condition [0x00007f6958504000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00000005ad829900> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:1757)
at io.debezium.connector.base.ChangeEventQueue.poll(ChangeEventQueue.java:275)
at io.debezium.connector.mysql.MySqlConnectorTask.doPoll(MySqlConnectorTask.java:252)
at io.debezium.connector.common.BaseSourceTask.poll(BaseSourceTask.java:182)
at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.poll(AbstractWorkerSourceTask.java:469)
at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.execute(AbstractWorkerSourceTask.java:357)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:204)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:259)
at org.apache.kafka.connect.runtime.AbstractWorkerSourceTask.run(AbstractWorkerSourceTask.java:77)
at org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:181)
at org.apache.kafka.connect.runtime.isolation.Plugins$$Lambda$953/0x00007f6a65b1cc50.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:539)
at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
at java.lang.Thread.run([email protected]/Thread.java:840)
The "change-event-source-coordinator" thread is stuck in BinaryLogClient.disconnect, which in turn is stuck waiting for the keepalive thread to finish:
"debezium-mysqlconnector-source_65afd9cf169951285b70c4a1-change-event-source-coordinator" #397 prio=5 os_prio=0 cpu=63726.27ms elapsed=482672.61s tid=0x00007f6a54458510 nid=0x1fd waiting on condition [0x00007f6961e9c000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x000000050c001a90> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:1672)
at java.util.concurrent.ThreadPoolExecutor.awaitTermination([email protected]/ThreadPoolExecutor.java:1464)
at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination([email protected]/Executors.java:743)
at com.github.shyiko.mysql.binlog.BinaryLogClient.awaitTerminationInterruptibly(BinaryLogClient.java:1337)
at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateKeepAliveThread(BinaryLogClient.java:1329)
at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnect(BinaryLogClient.java:1314)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:1001)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.execute(MySqlStreamingChangeEventSource.java:74)
at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:272)
at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:197)
at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:137)
at io.debezium.pipeline.ChangeEventSourceCoordinator$$Lambda$1777/0x00007f6a660beac0.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:539)
at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
at java.lang.Thread.run([email protected]/Thread.java:840)
The keepalive thread is stuck waiting for SSLSocket to close:
"blc-keepalive-production-aurora-80-cluster.cluster-cfqdxwvdz9bt.us-east-1.rds.amazonaws.com:3306" #401 prio=5 os_prio=0 cpu=56.99ms elapsed=569119.58s tid=0x00007f69d8079dc0 nid=0x201 waiting on condition [0x00007f69624a3000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x000000050c032238> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:715)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:938)
at java.util.concurrent.locks.ReentrantLock$Sync.lock([email protected]/ReentrantLock.java:153)
at java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:322)
at sun.security.ssl.SSLSocketImpl.closeNotify([email protected]/SSLSocketImpl.java:735)
at sun.security.ssl.TransportContext.closeNotify([email protected]/TransportContext.java:287)
at sun.security.ssl.TransportContext.initiateOutboundClose([email protected]/TransportContext.java:598)
at sun.security.ssl.TransportContext.closeOutbound([email protected]/TransportContext.java:582)
at sun.security.ssl.SSLSocketImpl.shutdownOutput([email protected]/SSLSocketImpl.java:875)
at com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close(PacketChannel.java:125)
at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel(BinaryLogClient.java:1361)
at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1345)
at com.github.shyiko.mysql.binlog.BinaryLogClient.access$800(BinaryLogClient.java:96)
at com.github.shyiko.mysql.binlog.BinaryLogClient$5.run(BinaryLogClient.java:894)
at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:539)
at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
at java.lang.Thread.run([email protected]/Thread.java:840)
Restarting the connector caused streaming to resume, the above threads remain stuck. Restart is currently needed every few hours, but it seems to become more frequent, at some point will need to restart the JVM to clean up these leaked threads.
Investigating potentially related JDK issue https://bugs.openjdk.org/browse/JDK-8241239?focusedId=14646711&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14646711
Any suggestions how to troubleshoot this further? Have you seen this before?
Debezium issue - https://issues.redhat.com/browse/DBZ-7570
said:
It's on my radar for 2.7.0.Alpha1 Thanks Chris, is there any workarounds for now, like mysql version change maybe as discussed in the other thread?
Chris Cranford (Naros): said:
Thanks Chris, is there any workarounds for now, like mysql version change maybe as discussed in the other thread?
Unfortunately nothing really at this time.
Even though debezium team addressed the issue in 2.7.0
https://issues.redhat.com/browse/DBZ-7570, they have introduced another bug
https://issues.redhat.com/browse/DBZ-7873