riot
riot copied to clipboard
Replication (snapshot and live) from one Elasticache Redis to another is failing
Hello,
I'm running the following command to replicate from 1 Elasticache cluster to another:
JAVA_OPTS="-Xmx64g" ./riot-redis -i --stacktrace -h source.cache.amazonaws.com -p 6379 replicate -h target.cache.amazonaws.com -p 6379 --batch 1
Source cluster INFO output:
# Server
redis_version:3.2.4
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
gcc_version:0.0.0
process_id:1
run_id:8cd8a940be86564d937725bc7dc967c9e51a94d7
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
# Replication
tcp_port:6379
uptime_in_seconds:39663067
uptime_in_days:459
hz:10
lru_clock:4366755
total_commands_processed:2756363612
instantaneous_ops_per_sec:3
total_net_input_bytes:195785379317
total_net_output_bytes:668172690832
instantaneous_input_kbps:0.12
instantaneous_output_kbps:4.46
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:904962796
keyspace_misses:8286
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
# Replication
role:slave
master_host:source.cache.amazonaws.com
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
# Stats
total_connections_received:1463228
total_commands_processed:2756363612
instantaneous_ops_per_sec:3
total_net_input_bytes:195785379317
total_net_output_bytes:668172690832
instantaneous_input_kbps:0.12
instantaneous_output_kbps:4.46
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:904962796
keyspace_misses:8286
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
# Replication
role:slave
master_host:source.cache.amazonaws.com
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:623807091890
repl_sync_enabled:1
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
# CPU
used_cpu_sys:17899.54
used_cpu_user:49092.53
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
# Cluster
cluster_enabled:0
# Keyspace
db0:keys=63463817,expires=33322835,avg_ttl=0
db5:keys=45502,expires=1812,avg_ttl=0
Target cluster INFO output:
# Server
redis_version:6.2.6
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:0.0.0
process_id:1
run_id:1317126a7e9e3af4e759006a6aaafc43fa0913f5
tcp_port:6379
server_time_usec:1682088327009544
uptime_in_seconds:217224
uptime_in_days:2
hz:10
configured_hz:10
lru_clock:4366726
executable:-
config_file:-
# Clients
connected_clients:7
cluster_connections:0
maxclients:65000
client_recent_max_input_buffer:32
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
# Memory
used_memory:5273747832
used_memory_human:4.91G
used_memory_rss:5364830208
used_memory_rss_human:5.00G
used_memory_peak:5387511200
used_memory_peak_human:5.02G
used_memory_peak_perc:97.89%
used_memory_overhead:1478635424
used_memory_startup:9306856
used_memory_dataset:3795112408
used_memory_dataset_perc:72.09%
allocator_allocated:5278263360
allocator_active:5279645696
allocator_resident:5363998720
used_memory_lua:33792
used_memory_lua_human:33.00K
master_repl_offset:240700495402
second_repl_offset:157087304785
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:240699446827
repl_backlog_histlen:1048576
# CPU
used_cpu_sys:18637.736549
used_cpu_user:27639.798232
used_cpu_sys_children:2.001466
used_cpu_user_children:13.251697
used_cpu_sys_main_thread:252.775360
used_cpu_user_main_thread:14353.886320
# Modules
# Errorstats
errorstat_ERR:count=2387
# Cluster
cluster_enabled:0
# Keyspace
db0:keys=20264055,expires=10620831,avg_ttl=1481086241362
The process gets stuck at 31% and never completes. Typical message is as follows:
14:35:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264349/63458641 (11:16:18 / 14:36:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264350/63458641 (11:17:18 / 14:37:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264351/63458641 (11:18:18 / 14:38:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264352/63458641 (11:19:18 / 14:39:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264353/63458641 (11:20:18 / 14:40:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264354/63458641 (11:21:18 / 14:41:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264355/63458641 (11:22:18 / 14:42:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264356/63458641 (11:23:18 / 14:43:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264357/63458641 (11:24:18 / 14:44:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264358/63458641 (11:25:18 / 24:20:45) 492.8/s
14:45:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264359/63458641 (11:26:18 / 24:22:53) 492.1/s
14:46:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264360/63458641 (11:27:18 / 24:25:01) 491.4/s
14:47:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264361/63458641 (11:28:18 / 24:27:09) 490.7/s
14:48:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264362/63458641 (11:29:18 / 24:29:17) 490.0/s
14:49:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264363/63458641 (11:30:18 / 24:31:25) 489.3/s
14:50:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264364/63458641 (11:31:18 / 24:33:33) 488.6/s
14:51:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264365/63458641 (11:32:18 / 24:35:40) 487.9/s
14:52:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
14:52:09.373 INFO io.lettuce.core.protocol.ReconnectionHandler : Reconnected to target.cache.amazonaws.com:6379
Sometimes I see the following stacktrace error:
15:29:04.475 SEVERE org.springframework.batch.core.step.AbstractStep : Encountered an error executing step scan-reader in job scan-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.TimeoutException
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
... 21 more
15:29:04.475 INFO org.springframework.batch.core.step.AbstractStep : Step: [scan-reader] executed in 1h34m57s282ms
During live replication (--mode live
), I'm getting the following error:
17:47:45.573 WARNING com.redis.spring.batch.reader.KeyspaceNotificationItemReader : Could not add key because queue is full
17:47:46.145 SEVERE org.springframework.batch.core.step.AbstractStep : Encountered an error executing step live-reader in job live-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022)
at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
... 21 more
Caused by: java.lang.OutOfMemoryError: Java heap space
Source & Target cluster both have 2 nodes in the replication group - each one with 16 CPUs & 64GB RAM. Replication instance has 32 CPUs & 128GB RAM.
I tried with many-many options (played around with different --batch
sizing, added --no-verify
, --skip-policy=ALWAYS
, multiple threads) - however nothing is working. Wondering if you can take a look and help me.
Thank you, Nodir
@nfayzullaev did you manage to solve the issue?
The tool works on a smaller DB (4GB in staging), but fails around 6GB (15GB in prod). We ran into a similar problem, but getting TimeoutExceptions. We tried playing around with multiple parameters, increasing and decreasing timeouts etc, but no luck In AWS we're not exceeding any network bandwidth allowance excess
./riot --info --stacktrace -h source.cache.amazonaws.com -p 6379 replicate --mode=SNAPSHOT -h target.cache.amazonaws.com -p 6379 --read-batch=10 --read-pool=8 --metrics --threads 4 --type=DUMP --timeout=30 --read-threads=2 --wait-timeout=4000 --write-pool=4
16:13:12.427 SEVERE org.springframework.batch.core.step.AbstractStep : Encountered an error executing step scan-reader in job scan-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.TimeoutException
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
... 21 more
16:13:12.429 INFO org.springframework.batch.core.step.AbstractStep : Step: [scan-reader] executed in 46m34s414ms
16:13:12.445 INFO org.springframework.batch.core.launch.support.SimpleJobLauncher : Job: [SimpleJob: [name=scan-reader]] completeScanning 34% ββββββββββββββββββββββββββΈβββββββββββββββββββββββββββββββββββββββββββββββ 4792300/13695788 (0:46:35 / 1:26:32) 1714.6/s
16:13:12.964 INFO org.springframework.batch.core.step.AbstractStep : Step: [snapshot-replication] executed in 46m36s439ms
16:13:13.068 INFO org.springframework.batch.core.job.SimpleStepHandler : Executing step: [verification]
Hello,
I'm running the following command to replicate from 1 Elasticache cluster to another:
JAVA_OPTS="-Xmx64g" ./riot-redis -i --stacktrace -h source.cache.amazonaws.com -p 6379 replicate -h target.cache.amazonaws.com -p 6379 --batch 1
Source cluster INFO output:
# Server redis_version:3.2.4 redis_git_sha1:0 redis_git_dirty:0 redis_build_id:0 redis_mode:standalone os:Amazon ElastiCache arch_bits:64 multiplexing_api:epoll gcc_version:0.0.0 process_id:1 run_id:8cd8a940be86564d937725bc7dc967c9e51a94d7 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:0 migrate_cached_sockets:0 # Replication tcp_port:6379 uptime_in_seconds:39663067 uptime_in_days:459 hz:10 lru_clock:4366755 total_commands_processed:2756363612 instantaneous_ops_per_sec:3 total_net_input_bytes:195785379317 total_net_output_bytes:668172690832 instantaneous_input_kbps:0.12 instantaneous_output_kbps:4.46 rejected_connections:0 sync_full:0 sync_partial_ok:0 sync_partial_err:0 expired_keys:0 evicted_keys:0 keyspace_hits:904962796 keyspace_misses:8286 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:0 migrate_cached_sockets:0 # Replication role:slave master_host:source.cache.amazonaws.com master_port:6379 master_link_status:up master_last_io_seconds_ago:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok # Stats total_connections_received:1463228 total_commands_processed:2756363612 instantaneous_ops_per_sec:3 total_net_input_bytes:195785379317 total_net_output_bytes:668172690832 instantaneous_input_kbps:0.12 instantaneous_output_kbps:4.46 rejected_connections:0 sync_full:0 sync_partial_ok:0 sync_partial_err:0 expired_keys:0 evicted_keys:0 keyspace_hits:904962796 keyspace_misses:8286 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:0 migrate_cached_sockets:0 # Replication role:slave master_host:source.cache.amazonaws.com master_port:6379 master_link_status:up master_last_io_seconds_ago:0 master_sync_in_progress:0 slave_repl_offset:623807091890 repl_sync_enabled:1 slave_priority:100 slave_read_only:1 connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0 # CPU used_cpu_sys:17899.54 used_cpu_user:49092.53 used_cpu_sys_children:0.00 used_cpu_user_children:0.00 # Cluster cluster_enabled:0 # Keyspace db0:keys=63463817,expires=33322835,avg_ttl=0 db5:keys=45502,expires=1812,avg_ttl=0
Target cluster INFO output:
# Server redis_version:6.2.6 redis_git_sha1:0 redis_git_dirty:0 redis_build_id:0 redis_mode:standalone os:Amazon ElastiCache arch_bits:64 multiplexing_api:epoll atomicvar_api:c11-builtin gcc_version:0.0.0 process_id:1 run_id:1317126a7e9e3af4e759006a6aaafc43fa0913f5 tcp_port:6379 server_time_usec:1682088327009544 uptime_in_seconds:217224 uptime_in_days:2 hz:10 configured_hz:10 lru_clock:4366726 executable:- config_file:- # Clients connected_clients:7 cluster_connections:0 maxclients:65000 client_recent_max_input_buffer:32 client_recent_max_output_buffer:0 blocked_clients:0 tracking_clients:0 clients_in_timeout_table:0 # Memory used_memory:5273747832 used_memory_human:4.91G used_memory_rss:5364830208 used_memory_rss_human:5.00G used_memory_peak:5387511200 used_memory_peak_human:5.02G used_memory_peak_perc:97.89% used_memory_overhead:1478635424 used_memory_startup:9306856 used_memory_dataset:3795112408 used_memory_dataset_perc:72.09% allocator_allocated:5278263360 allocator_active:5279645696 allocator_resident:5363998720 used_memory_lua:33792 used_memory_lua_human:33.00K master_repl_offset:240700495402 second_repl_offset:157087304785 repl_backlog_active:1 repl_backlog_size:1048576 repl_backlog_first_byte_offset:240699446827 repl_backlog_histlen:1048576 # CPU used_cpu_sys:18637.736549 used_cpu_user:27639.798232 used_cpu_sys_children:2.001466 used_cpu_user_children:13.251697 used_cpu_sys_main_thread:252.775360 used_cpu_user_main_thread:14353.886320 # Modules # Errorstats errorstat_ERR:count=2387 # Cluster cluster_enabled:0 # Keyspace db0:keys=20264055,expires=10620831,avg_ttl=1481086241362
The process gets stuck at 31% and never completes. Typical message is as follows:
14:35:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264349/63458641 (11:16:18 / 14:36:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264350/63458641 (11:17:18 / 14:37:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264351/63458641 (11:18:18 / 14:38:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264352/63458641 (11:19:18 / 14:39:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264353/63458641 (11:20:18 / 14:40:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264354/63458641 (11:21:18 / 14:41:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264355/63458641 (11:22:18 / 14:42:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264356/63458641 (11:23:18 / 14:43:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264357/63458641 (11:24:18 / 14:44:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264358/63458641 (11:25:18 / 24:20:45) 492.8/s 14:45:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264359/63458641 (11:26:18 / 24:22:53) 492.1/s 14:46:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264360/63458641 (11:27:18 / 24:25:01) 491.4/s 14:47:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264361/63458641 (11:28:18 / 24:27:09) 490.7/s 14:48:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264362/63458641 (11:29:18 / 24:29:17) 490.0/s 14:49:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264363/63458641 (11:30:18 / 24:31:25) 489.3/s 14:50:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264364/63458641 (11:31:18 / 24:33:33) 488.6/s 14:51:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 Scanning 31% βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20264365/63458641 (11:32:18 / 24:35:40) 487.9/s 14:52:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog : Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379 14:52:09.373 INFO io.lettuce.core.protocol.ReconnectionHandler : Reconnected to target.cache.amazonaws.com:6379
Sometimes I see the following stacktrace error:
15:29:04.475 SEVERE org.springframework.batch.core.step.AbstractStep : Encountered an error executing step scan-reader in job scan-reader org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.TimeoutException at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429) at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539) at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387) at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255) at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217) at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444) at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217) at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77) at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407) at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331) at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273) at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82) at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375) at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215) at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145) at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258) at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208) at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152) at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413) at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136) at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320) at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.util.concurrent.TimeoutException at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42) at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77) at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27) at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151) at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193) at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159) at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348) at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329) ... 21 more 15:29:04.475 INFO org.springframework.batch.core.step.AbstractStep : Step: [scan-reader] executed in 1h34m57s282ms
During live replication (
--mode live
), I'm getting the following error:17:47:45.573 WARNING com.redis.spring.batch.reader.KeyspaceNotificationItemReader : Could not add key because queue is full 17:47:46.145 SEVERE org.springframework.batch.core.step.AbstractStep : Encountered an error executing step live-reader in job live-reader org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429) at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539) at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387) at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255) at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217) at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444) at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217) at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77) at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407) at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331) at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273) at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82) at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375) at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215) at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145) at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258) at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208) at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152) at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413) at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136) at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320) at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022) at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42) at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77) at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27) at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151) at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193) at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159) at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348) at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329) ... 21 more Caused by: java.lang.OutOfMemoryError: Java heap space
Source & Target cluster both have 2 nodes in the replication group - each one with 16 CPUs & 64GB RAM. Replication instance has 32 CPUs & 128GB RAM.
I tried with many-many options (played around with different
--batch
sizing, added--no-verify
,--skip-policy=ALWAYS
, multiple threads) - however nothing is working. Wondering if you can take a look and help me.Thank you, Nodir
In my environment, I also encountered the same problem as you. Through testing, I found that the reason is that there are hot keys and large keys in the source DB, which makes riot unable to normally listen ; This is why OOM;