clickhouse-sink-connector
clickhouse-sink-connector copied to clipboard
MySQL Snapshot fails at 50m rows and keeps starting over and over again
Hello there,
The lightweight version is a true magic, one container and you have near real time replication from MySQL to Clickhouse with all the columns moved right for you, a piece of art!
For a while now I'm debugging the issue java.lang.InterruptedException: null happens at ~50 million rows crashes and keep on happening.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1638) ~[?:?]
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435) ~[?:?]
at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:200) ~[?:?]
at io.debezium.relational.RelationalSnapshotChangeEventSource.createDataEvents(RelationalSnapshotChangeEventSource.java:515) ~[app.jar:?]
at io.debezium.relational.RelationalSnapshotChangeEventSource.doExecute(RelationalSnapshotChangeEventSource.java:171) ~[app.jar:?]
at io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:96) ~[app.jar:?]
at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:285) ~[app.jar:?]
at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:269) ~[app.jar:?]
at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:192) ~[app.jar:?]
at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:143) ~[app.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:831) [?:?]
What I have tired so far:
- Increased all the timeout in MySQL to the max:
SELECT
variable_name, variable_value
FROM
performance_schema.global_variables
WHERE
variable_name IN ('interactive_timeout' , 'wait_timeout',
'connect_timeout',
'net_read_timeout',
'net_write_timeout',
'lock_wait_timeout',
'innodb_lock_wait_timeout',
'max_execution_time')
ORDER BY variable_name;
- Increased memory and watched the used memory using
jcmd 1 GC.heap_infoinside the container:
So, it's not memory usage problem
- Used different versions (branches) complied them and deployed to k8s but still same error.
- Tried on different big tables, and it do the same but it breaks at different row size, the other table were smaller.
I'm using no locks snapshot.locking.mode: "none"
Related issue https://github.com/Altinity/clickhouse-sink-connector/issues/722
What I'm trying now is attach local Intellij remote debugger, but I'm having hard time stopping on the right exception.
@poweroftrue for MySQL, you can take a snapshot via MySQLshell (see python utilities in this project). It takes a consistent snapshot, then you can restore it to CH and just start replicating from the snapshot position and even before.
You can dump and TB of data this way.
Thanks @aadant!
Yes I checked the tool, it's amazing it dumps TBs of data very fast, I will use it after I try couple of weeks on this issue because it fit my use case better.
Having Debezium snapshot working open the door for simpler automation running this as a sidecar container for large number of MySQL services.
Hello @aadant, I finally found what causes this issue!
I was debugging if certain rows causes the interruption but it wasn't, I enabled all logs (DEBUG) from log4j.xml and I found that Debezium was asked to stop.
It turned out to be restart.event.loop kicking in and restarting Debezium for not having events (normal during this kind of snapshot)
2024-09-12 03:03:11.146 INFO - Stopping the embedded engine
2024-09-12 03:03:11.146 INFO - Waiting for PT5M for connector to stop
2024-09-12 03:03:11.162 INFO - Stopping the task and engine
2024-09-12 03:03:11.162 INFO - Stopping down connector
2024-09-12 03:04:41.162 WARN - Coordinator didn't stop in the expected time, shutting down executor now
I set it to false and now it's snapshotting rows beyond 100 million.
Should I add to docs that you need this disabled during the snapshot?
Good catch ! I think it is a bug. We should fix this as it is very misleading. The connector should only restart if it is idle (this is a workaround against a Debezium bug that should be fixed in the latest releases).
See https://github.com/Altinity/clickhouse-sink-connector/issues/380 It is recommended to replicate with a heartbeat table to know if the connector is lagging. If they are no updates, the connector will not update the replica_source_info table.
@subkanthi let us fix this even if there is a workaround.