mypipe
mypipe copied to clipboard
Snapshotting a large table
Hi,
I'm trying to snapshot a large table (~100 million rows) to kafka to bootstrap a replica of a mysql table on HDFS. I'm using the --no-transaction
flag because I don't have FLUSH permissions on the database. First, I had to extend the timeout in the handleEvent
method. Now, I'm running into the following garbage collection error:
Exception in thread "metrics-meter-tick-thread-1" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "metrics-meter-tick-thread-3" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "metrics-meter-tick-thread-4" Exception in thread "shutdownHook1" java.lang.OutOfMemoryError: GC overhead limit exceeded
From what I can tell, it appears the entire table snapshot is contained within a single SelectEvent. This error occurs after a few minutes during the SelectConsumer.handleEvents() loop. Do you have any recommendations on how to get around the garbage collection issue? Thanks for all your work on this project!
@mbittmann thanks for the feedback.
The current implementation is very naive in terms of handling large tables. I've been looking around similar projects to see how they handle this, and I like the way that Sqoop can split a table into multiple parts based on a split-by column. I'm going to implement similar functionality for mypipe soon unless someone else gets to it first (=
In the mean time, you can try giving the JVM more memory and see if that helps, although this is really a terrible and very temporary solution at best.
Thanks for the reply! That makes sense. I ended up going with sqoop to bootstrap the tables, which also has the advantage of bypassing Kafka. There were a few serialization issues to tackle with sql column types being mapped to different avro types, such as with timestamps and certain flavors of tinyint.
Starting to make progress here, @mbittmann. See commit ~~63d1f43e0f1d025d511052e27a7e5b03e165a3bc~~ 6aff568244026bea87e438f526dd3969a9a81536.