mongo-migration-stream
mongo-migration-stream copied to clipboard
[Bug] Read timeout when trying to tranfer index
Describe the bug I'm trying to transfer a single collection to get started with the tool. Dump & restore of the data did work so far, however, when it comes to replicating the indexes the following error occurs:
Sep 28 14:27:20 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:20.888 [performer-1] INFO p.a.t.m.c.t.MongoToolsTransfer - Finished transfer of collection [foo.bar from [mongodb://foo1:27017,foo2:27017,foo3:27017/foo?ssl=true&replicaSet=atlas-shard-0] to [mongodb://172.17.4.100,172.17.5.100,172.17.6.100/foo?tls=false&ssl=false&retryWrites=true&w=majority]
Sep 28 14:27:20 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:20.889 [performer-1] INFO p.a.t.m.core.index.IndexClonerImpl - Cloning all indexes for collection: [bar]
Sep 28 14:27:20 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:20.890 [performer-1] INFO p.a.t.m.i.h.LoggingStateEventHandler - IndexRebuildStartEvent(sourceToDestination=source: foo.bar, destination: foo.bar)
Sep 28 14:27:20 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:20.892 [performer-1] INFO p.a.t.m.c.s.LocalToDestinationSynchronizer - Starting LocalToDestination synchronization for collection: [source: foo.bar, destination: foo.bar]
Sep 28 14:27:20 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:20.893 [performer-1] INFO p.a.t.m.i.h.LoggingStateEventHandler - LocalToDestinationStartEvent(sourceToDestination=source: foo.bar, destination: foo.bar)
Sep 28 14:27:20 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:20.897 [performer-1] INFO p.a.t.m.i.d.SynchronizationDetectorFactory - Starting detecting synchronization...
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:30.972 [synchronizationDetector-4] WARN p.a.t.m.c.d.HashSynchronizationDetector - Unable to perform detection with hash. Cause: [com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message]
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: Exception in thread "index-cloner-foo.bar" java.util.concurrent.CompletionException: com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1770)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at java.base/java.lang.Thread.run(Thread.java:833)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: Caused by: com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:699)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.InternalStreamConnection.receiveMessageWithAdditionalTimeout(InternalStreamConnection.java:577)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:413)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:340)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:116)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:643)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:71)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:206)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:119)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:85)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:75)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.DefaultServer$OperationCountTrackingConnection.command(DefaultServer.java:293)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.operation.CommandOperationHelper.createReadCommandAndExecute(CommandOperationHelper.java:233)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.operation.CommandOperationHelper.lambda$executeRetryableRead$4(CommandOperationHelper.java:215)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.operation.OperationHelper.lambda$withSourceAndConnection$0(OperationHelper.java:356)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.operation.OperationHelper.withSuppliedResource(OperationHelper.java:381)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.operation.OperationHelper.lambda$withSourceAndConnection$1(OperationHelper.java:355)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.operation.OperationHelper.withSuppliedResource(OperationHelper.java:381)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.operation.OperationHelper.withSourceAndConnection(OperationHelper.java:354)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.operation.CommandOperationHelper.lambda$executeRetryableRead$5(CommandOperationHelper.java:213)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.async.function.RetryingSyncSupplier.get(RetryingSyncSupplier.java:67)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.operation.CommandOperationHelper.executeRetryableRead(CommandOperationHelper.java:218)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.operation.CommandOperationHelper.executeRetryableRead(CommandOperationHelper.java:199)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.operation.CommandReadOperation.execute(CommandReadOperation.java:50)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:191)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.client.internal.MongoDatabaseImpl.executeCommand(MongoDatabaseImpl.java:202)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:171)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:166)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:156)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at pl.allegro.tech.mongomigrationstream.core.index.IndexClonerImpl.createIndexOnDestinationCollection(IndexClonerImpl.kt:47)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at pl.allegro.tech.mongomigrationstream.core.index.IndexClonerImpl.cloneIndexes$lambda$3$lambda$1$lambda$0(IndexClonerImpl.kt:32)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: ... 3 more
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: Caused by: java.net.SocketTimeoutException: Read timed out
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:288)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:314)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at java.base/java.net.Socket$SocketInputStream.read(Socket.java:966)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.SocketStream.read(SocketStream.java:113)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.SocketStream.read(SocketStream.java:138)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:716)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: at com.mongodb.internal.connection.InternalStreamConnection.receiveMessageWithAdditionalTimeout(InternalStreamConnection.java:574)
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: ... 33 more
Sep 28 14:27:30 ip-172-31-28-185.eu-central-1.compute.internal java[148206]: 14:27:30.976 [synchronizationDetector-1] INFO p.a.t.m.i.h.LoggingDetectionResultHandler - Queue size detector - collection: [source: foo.bar, destination: foo.bar] queue size: [0]
To Reproduce This is the config:
perform.transfer=true
perform.synchronization=true
perform.synchronization.handlers=LoggingSynchronizationHandler
perform.synchronization.detectors=QueueSize,CollectionCount,DbHash
perform.synchronization.validators=DbAvailability,DestinationCollectionMissing,SourceCollectionAvailable
custom.rootPath=/home/ec2-user/mongomigrationstream/
custom.queue.factory=BiqQueueFactory
custom.dumpReadPreference=primary
custom.batchSize=1000
collections.source=bar
collections.destination=bar
source.db.uri=mongodb://foo1:27017,foo2:27017,foo3:27017/foo?ssl=true&replicaSet=atlas-shard-0
source.db.name=foo
source.db.authentication.enabled=true
source.db.authentication.username=user
source.db.authentication.password=pwd
source.db.authentication.authDbName=admin
destination.db.uri=mongodb://172.17.4.100,172.17.5.100,172.17.6.100/foo?tls=false&ssl=false&retryWrites=true&w=majority
destination.db.name=foo
destination.db.authentication.enabled=true
destination.db.authentication.username=user
destination.db.authentication.password=pwd
destination.db.authentication.authDbName=admin
Expected behavior Initial transfer should work.
Environment (please complete the following information):
- OS: Amazon Linux 2 as a migration server
- Version: mongo-migration-stream-0.7.0
Additional context If I get it to work, this tool is really, really useful to me! Thanks for open-sourcing it!
Hi again,
it looks like the actual problem is the dbHash command that is being issued by HashSynchronizationDetector. For the source database, in my case containing ~30million documents, this is a showstopper unfortunately as the commands seems to completely drag down the database.
I found that this detector can be disabled via config, so I will try to have it run without that. What will be the implications if that detector is missing?
EDIT: without HashSynchronizationDetector the source DB doesn't complain anymore, but the index-migration still fails. Any idea what might go wrong?
@pitagoras3 I added an PR to make timeouts and read preference configurable. In my case, increasing the timeouts and reading from the secondary made the migration work smoothly until now.
One thing that's still left to discuss would be the HashSynchronizationDetector - what is the intention behind this? Did it work for your migrations? I read in the blog post that you had large datasets as well...
Hi @alxgrk,
Firstly, thanks for trying out mongo-migration-stream :) I will answer your questions one by one:
it looks like the actual problem is the dbHash command that is being issued by HashSynchronizationDetector. For the source database, in my case containing ~30million documents, this is a showstopper unfortunately as the commands seems to completely drag down the database. I found that this detector can be disabled via config, so I will try to have it run without that. What will be the implications if that detector is missing?
Implications are such that migrator won't calculate hashes for source and destination collections, and will not compare those hashes (for extra certainty that migration went well). From what I can tell you, on production in Allegro we also have disabled HashSynchronizationDetector, and we're relying on QueueSize,CollectionCount detectors.
@pitagoras3 I added an PR to make timeouts and read preference configurable. In my case, increasing the timeouts and reading from the secondary made the migration work smoothly until now.
Thank you! I will take a look into it.
One thing that's still left to discuss would be the HashSynchronizationDetector - what is the intention behind this? Did it work for your migrations? I read in the blog post that you had large datasets as well...
So, the intention is to answer the question "Are all documents migrated?" To answer that question, I've implemented 3 SynchronizationDetectors which are evaluated every 10 seconds after all collections get to synchronization state.
CollectionCountSynchronizationDetector- it "at the same time" gets estimatedDocumentCount for source collection and destination collection and prints the result (and their diff). Ideal situation is when the difference is equal to0(both collections have the same size).QueueSizeSynchronizationDetectoris simpler - it checks whether there are any events on the migrator queue to send to the destination collection. Ideal situation is when the size of the queue is0(there are no events to send).HashSynchronizationDetector- it performsdbHashcommand on both source collection and destination collection and verifies if hashes are equal. Ideal situation is when both hashes are equal (meaning that documents in both collections are exactly the same).
As you discovered, dbHash command can take "some" time to evaluate (depending on collection size). I will definitely add this info in the documentation :)
Have I answered all your questions, or can I clarify something more?
Your answers are perfectly fine, thanks @pitagoras3 .