mysql-binlog-connector-java icon indicating copy to clipboard operation
mysql-binlog-connector-java copied to clipboard

Deadlock when connection try to disconnect

Open aadant opened this issue 1 year ago • 6 comments

"blc-fpif1-txnrepodbl1:3306" #31833 prio=5 os_prio=0 cpu=3676437.42ms elapsed=146955.63s tid=0x00007fe9bc001800 nid=0x7e58 runnable  [0x00007feb830d3000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketOutputStream.socketWrite0([email protected]/Native Method)
        at java.net.SocketOutputStream.socketWrite([email protected]/SocketOutputStream.java:110)
        at java.net.SocketOutputStream.write([email protected]/SocketOutputStream.java:150)
        at sun.security.ssl.SSLSocketOutputRecord.flush([email protected]/SSLSocketOutputRecord.java:271)
        at sun.security.ssl.OutputRecord.changeWriteCiphers([email protected]/OutputRecord.java:224)
        at sun.security.ssl.KeyUpdate$KeyUpdateProducer.produce([email protected]/KeyUpdate.java:321)
        at sun.security.ssl.KeyUpdate$KeyUpdateKickstartProducer.produce([email protected]/KeyUpdate.java:171)
        at sun.security.ssl.SSLHandshake.kickstart([email protected]/SSLHandshake.java:538)
        at sun.security.ssl.PostHandshakeContext.kickstart([email protected]/PostHandshakeContext.java:68)
        at sun.security.ssl.TransportContext.kickstart([email protected]/TransportContext.java:233)
        at sun.security.ssl.SSLSocketImpl.startHandshake([email protected]/SSLSocketImpl.java:449)
        at sun.security.ssl.SSLSocketImpl.startHandshake([email protected]/SSLSocketImpl.java:427)
        at sun.security.ssl.SSLSocketImpl.tryKeyUpdate([email protected]/SSLSocketImpl.java:1549)
        at sun.security.ssl.SSLSocketImpl.decode([email protected]/SSLSocketImpl.java:1521)
        at sun.security.ssl.SSLSocketImpl.readApplicationRecord([email protected]/SSLSocketImpl.java:1476)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read([email protected]/SSLSocketImpl.java:1065)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read([email protected]/SSLSocketImpl.java:972)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:226)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:207)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readInteger(ByteArrayInputStream.java:52)
        at io.debezium.connector.mysql.RowDeserializers.deserializeDate(RowDeserializers.java:298)
        at io.debezium.connector.mysql.RowDeserializers$WriteRowsDeserializer.deserializeDate(RowDeserializers.java:215)
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeCell(AbstractRowsEventDataDeserializer.java:171)
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeRow(AbstractRowsEventDataDeserializer.java:143)
        at com.github.shyiko.mysql.binlog.event.deserialization.WriteRowsEventDataDeserializer.deserializeRows(WriteRowsEventDataDeserializer.java:64)
        at com.github.shyiko.mysql.binlog.event.deserialization.WriteRowsEventDataDeserializer.deserialize(WriteRowsEventDataDeserializer.java:56)
        at com.github.shyiko.mysql.binlog.event.deserialization.WriteRowsEventDataDeserializer.deserialize(WriteRowsEventDataDeserializer.java:32)
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.deserializeEventData(EventDeserializer.java:337)
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.nextEvent(EventDeserializer.java:246)
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$1.nextEvent(MySqlStreamingChangeEventSource.java:238)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1068)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:648)
        at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:949)
        at java.lang.Thread.run([email protected]/Thread.java:829)

"blc-keepalive-fpif1-txnrepodbl1:3306" #31835 prio=5 os_prio=0 cpu=53.87ms elapsed=146955.62s tid=0x00007fe9b8001000 nid=0x7e5a waiting on condition  [0x00007feb8190c000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00007ff27751cfa0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
        at java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:267)
        at sun.security.ssl.SSLSocketImpl.closeNotify([email protected]/SSLSocketImpl.java:732)
        at sun.security.ssl.TransportContext.closeNotify([email protected]/TransportContext.java:262)
        at sun.security.ssl.TransportContext.initiateOutboundClose([email protected]/TransportContext.java:573)
        at sun.security.ssl.TransportContext.closeOutbound([email protected]/TransportContext.java:557)
        at sun.security.ssl.SSLSocketImpl.shutdownOutput([email protected]/SSLSocketImpl.java:871)
        at com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close(PacketChannel.java:125)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel(BinaryLogClient.java:1361)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1345)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.access$800(BinaryLogClient.java:96)
        at com.github.shyiko.mysql.binlog.BinaryLogClient$5.run(BinaryLogClient.java:894)
        at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
        at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run([email protected]/Thread.java:829)

message in the log file

INFO: Keepalive: Trying to restore lost connection to fpif1-txnrepodbl1:3306

waiting forever

Given the code, it looks like this affects master ([0.28.3])

aadant avatar Nov 29 '23 02:11 aadant

Hi, so I'm wondering, based on looking at the JDK11 bug history, if this may not be related to some past issues with JDK11 and SSL connections hitting a deadlock scenario where you have this unique combination of the socket shutting down:

https://bugs.openjdk.org/browse/JDK-8207004 https://bugs.openjdk.org/browse/JDK-8207009

Can you please share what JDK vendor you're using? Would it be possible to upgrade to 11.0.20 and see whether this problem goes away, or better yet move to JDK17 as JDK11 is being EoL in October 2024?

Naros avatar Nov 30 '23 04:11 Naros

@Naros thanks, looks like it, indeed we use this

# java --version
openjdk 11.0.16 2022-07-19
OpenJDK Runtime Environment 18.9 (build 11.0.16+8)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.16+8, mixed mode, sharing)

aadant avatar Nov 30 '23 16:11 aadant

@Naros unfortunately the issue continues to happen with OpenJDK 17 ...

bash-4.4# java --version
openjdk 17.0.2 2022-01-18
OpenJDK Runtime Environment (build 17.0.2+8-86)
OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)

aadant avatar Dec 06 '23 04:12 aadant

stack trace involving shyiko :

"blc-fpif1-txnrepodbl1:3306" #77 prio=5 os_prio=0 cpu=3406817.79ms elapsed=71952.46s tid=0x00007feb4423f7a0 nid=0x11a runnable  [0x00007feb5affd000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcherImpl.write0([email protected]/Native Method)
        at sun.nio.ch.SocketDispatcher.write([email protected]/SocketDispatcher.java:62)
        at sun.nio.ch.NioSocketImpl.tryWrite([email protected]/NioSocketImpl.java:398)
        at sun.nio.ch.NioSocketImpl.implWrite([email protected]/NioSocketImpl.java:413)
        at sun.nio.ch.NioSocketImpl.write([email protected]/NioSocketImpl.java:440)
        at sun.nio.ch.NioSocketImpl$2.write([email protected]/NioSocketImpl.java:826)
        at java.net.Socket$SocketOutputStream.write([email protected]/Socket.java:1035)
        at sun.security.ssl.SSLSocketOutputRecord.flush([email protected]/SSLSocketOutputRecord.java:271)
        at sun.security.ssl.OutputRecord.changeWriteCiphers([email protected]/OutputRecord.java:224)
        at sun.security.ssl.KeyUpdate$KeyUpdateProducer.produce([email protected]/KeyUpdate.java:321)
        at sun.security.ssl.KeyUpdate$KeyUpdateKickstartProducer.produce([email protected]/KeyUpdate.java:171)
        at sun.security.ssl.SSLHandshake.kickstart([email protected]/SSLHandshake.java:542)
        at sun.security.ssl.PostHandshakeContext.kickstart([email protected]/PostHandshakeContext.java:69)
        at sun.security.ssl.TransportContext.kickstart([email protected]/TransportContext.java:251)
        at sun.security.ssl.SSLSocketImpl.startHandshake([email protected]/SSLSocketImpl.java:443)
        at sun.security.ssl.SSLSocketImpl.startHandshake([email protected]/SSLSocketImpl.java:421)
        at sun.security.ssl.SSLSocketImpl.tryKeyUpdate([email protected]/SSLSocketImpl.java:1543)
        at sun.security.ssl.SSLSocketImpl.decode([email protected]/SSLSocketImpl.java:1515)
        at sun.security.ssl.SSLSocketImpl.readApplicationRecord([email protected]/SSLSocketImpl.java:1474)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read([email protected]/SSLSocketImpl.java:1059)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:266)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:245)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.fill(ByteArrayInputStream.java:112)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:105)
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeNewDecimal(AbstractRowsEventDataDeserializer.java:258)
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeCell(AbstractRowsEventDataDeserializer.java:169)
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeRow(AbstractRowsEventDataDeserializer.java:143)
        at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserializeRows(UpdateRowsEventDataDeserializer.java:72)
        at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserialize(UpdateRowsEventDataDeserializer.java:58)
        at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserialize(UpdateRowsEventDataDeserializer.java:33)
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.deserializeEventData(EventDeserializer.java:337)
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.nextEvent(EventDeserializer.java:246)
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$1.nextEvent(MySqlStreamingChangeEventSource.java:238)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1068)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:648)
        at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:949)
        at java.lang.Thread.run([email protected]/Thread.java:833)

"blc-keepalive-fpif1-txnrepodbl1:3306" #79 prio=5 os_prio=0 cpu=30.29ms elapsed=71952.43s tid=0x00007feb484dd850 nid=0x11c waiting on condition  [0x00007feb5aefd000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x00007fefd423e348> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:715)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:938)
        at java.util.concurrent.locks.ReentrantLock$Sync.lock([email protected]/ReentrantLock.java:153)
        at java.util.concurrent.locks.ReentrantLock.lock([email protected]/ReentrantLock.java:322)
        at sun.security.ssl.SSLSocketImpl.closeNotify([email protected]/SSLSocketImpl.java:726)
        at sun.security.ssl.TransportContext.closeNotify([email protected]/TransportContext.java:280)
        at sun.security.ssl.TransportContext.initiateOutboundClose([email protected]/TransportContext.java:591)
        at sun.security.ssl.TransportContext.closeOutbound([email protected]/TransportContext.java:575)
        at sun.security.ssl.SSLSocketImpl.shutdownOutput([email protected]/SSLSocketImpl.java:865)
        at com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close(PacketChannel.java:125)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel(BinaryLogClient.java:1361)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1345)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.access$800(BinaryLogClient.java:96)
        at com.github.shyiko.mysql.binlog.BinaryLogClient$5.run(BinaryLogClient.java:894)
        at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:539)
        at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
        at java.lang.Thread.run([email protected]/Thread.java:833)

aadant avatar Dec 06 '23 04:12 aadant

Hi @aadant, can you describe the steps on how to reproduce it reliably?

Naros avatar Dec 07 '23 05:12 Naros

It happens in the context of debezium embedded (Altinity sink connector). The problem occurs several times a week. There is an open issue there. It is a random issue though.

aadant avatar Dec 07 '23 13:12 aadant

Hi, bringing this topic up again because we have encountered it in our company.

Given: AWS Aurora 3, AWS MSK, Confluent Kafka Connect 7.6.1, Debezium 2.6.1. openjdk version "11.0.22" 2024-01-16 LTS OpenJDK Runtime Environment Zulu11.70+15-CA (build 11.0.22+7-LTS) OpenJDK 64-Bit Server VM Zulu11.70+15-CA (build 11.0.22+7-LTS, mixed mode)

Problems started after MySQL upgrade from version 5.7 to 8 i.e. Aurora 2 -> 3. In the logs we see the message Keepalive: Trying to restore lost connection to xxxx-xxxx-xxx.rds.amazonaws.com:3306 and silence. First of all I tried setting the use.nongraceful.disconnect flag, but that had no effect. Then I started digging deeper. I modified your connector code a bit, added logs to it after each step and found two things:

  1. For some reason, setting the use.nongraceful.disconnect flag in the Debezium connector does not cause the useNonGracefulDisconnect field to be set to the appropriate value. I verified this by adding the output of the field values to the log here. So I just hardcoded true wherever that flag is used.
  2. The hang occurs when attempting a half-close here, and the workaround implemented here added setting SO_LINGER after the half-close attempt but before closing. I'm not an expert in TCP, but I assumed that this option had something to do with the half-close, so I moved this block up to the beginning of the close() method and it worked, the connection immediately started to close and the connector re-established the connection.

About reliably reproducing the bug. Based on what I described above, namely that the hang happens on half-close, I'm not sure how it can be reproduced. In our case, Aurora DB sometimes has such extreme loads that it just dies for a few minutes, at which point Debezium stops getting heartbeats in response and tries to restart BinaryLogClient. In the case of a read socket close, an SSLException occurs:

javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify
	at java.base/sun.security.ssl.SSLSocketImpl.shutdownInput(SSLSocketImpl.java:848)
	at java.base/sun.security.ssl.SSLSocketImpl.shutdownInput(SSLSocketImpl.java:826)
	at com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close(PacketChannel.java:141)
	at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel(BinaryLogClient.java:1386)
	at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1361)
	at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1356)
	at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnect(BinaryLogClient.java:1327)
	at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$ReaderThreadLifecycleListener.onCommunicationFailure(MySqlStreamingChangeEventSource.java:1132)
	at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1106)
	at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:655)
	at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:958)
	at java.base/java.lang.Thread.run(Thread.java:829)

it is ignored and proceeds to a write socket close, where the endless waiting takes place, because in our case the server is in a coma and not responding. How to reproduce this, I don't know. I hope my experiment will give someone useful insights.

P.S. Recommendation from me: please add more logs, at least with DEBUG level.

comrada avatar Jun 05 '24 12:06 comrada

@comrada Debezium 2.6.1 contains 0.29.1 of binlog client. The issue should be fixed in 0.29.2. Could you please try to use it in yor deployment if it helps with the problem?

jpechane avatar Jun 07 '24 11:06 jpechane

Hi @comrada , does the 0.29.2 work for you? I'm facing a similar issue after upgrading to Aurora v3.

phamlehieu avatar Jun 13 '24 04:06 phamlehieu

@jpechane no, 0.29.2 does not fix this problem, because as I wrote in my comment, the infinite wait in my case happens while trying to close the socket for writing, and the 0.29.2 fix sets the SO_LINGER property after this call. In my fix (based on 0.29.2), I moved the property setting to the beginning of the close() method and it works for me. image

comrada avatar Jun 24 '24 09:06 comrada

Hi @comrada We're also on Aurora v3 and running into the same deadlock issue, although I'm not sure if the disconnect happens because of aurora dying sometimes like you mentioned in your case. For this change you mention though, closing the socket without linger is generally discouraged from what I've read online. So I guess this should be directed towards the people that implemented this initially to address this issue.

But I wanted to ask if this can lead to any data loss? Has this fix been working fine for you ?

Vittalgc avatar Jul 10 '24 05:07 Vittalgc