riot icon indicating copy to clipboard operation
riot copied to clipboard

Replication (snapshot and live) from one Elasticache Redis to another is failing

Open nfayzullaev opened this issue 1 year ago β€’ 2 comments

Hello,

I'm running the following command to replicate from 1 Elasticache cluster to another: JAVA_OPTS="-Xmx64g" ./riot-redis -i --stacktrace -h source.cache.amazonaws.com -p 6379 replicate -h target.cache.amazonaws.com -p 6379 --batch 1

Source cluster INFO output:

# Server
redis_version:3.2.4
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
gcc_version:0.0.0
process_id:1
run_id:8cd8a940be86564d937725bc7dc967c9e51a94d7
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
tcp_port:6379
uptime_in_seconds:39663067
uptime_in_days:459
hz:10
lru_clock:4366755
total_commands_processed:2756363612
instantaneous_ops_per_sec:3
total_net_input_bytes:195785379317
total_net_output_bytes:668172690832
instantaneous_input_kbps:0.12
instantaneous_output_kbps:4.46
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:904962796
keyspace_misses:8286
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:slave
master_host:source.cache.amazonaws.com
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:1463228
total_commands_processed:2756363612
instantaneous_ops_per_sec:3
total_net_input_bytes:195785379317
total_net_output_bytes:668172690832
instantaneous_input_kbps:0.12
instantaneous_output_kbps:4.46
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:904962796
keyspace_misses:8286
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:slave
master_host:source.cache.amazonaws.com
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:623807091890
repl_sync_enabled:1
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:17899.54
used_cpu_user:49092.53
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=63463817,expires=33322835,avg_ttl=0
db5:keys=45502,expires=1812,avg_ttl=0

Target cluster INFO output:

# Server
redis_version:6.2.6
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:0.0.0
process_id:1
run_id:1317126a7e9e3af4e759006a6aaafc43fa0913f5
tcp_port:6379
server_time_usec:1682088327009544
uptime_in_seconds:217224
uptime_in_days:2
hz:10
configured_hz:10
lru_clock:4366726
executable:-
config_file:-

# Clients
connected_clients:7
cluster_connections:0
maxclients:65000
client_recent_max_input_buffer:32
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:5273747832
used_memory_human:4.91G
used_memory_rss:5364830208
used_memory_rss_human:5.00G
used_memory_peak:5387511200
used_memory_peak_human:5.02G
used_memory_peak_perc:97.89%
used_memory_overhead:1478635424
used_memory_startup:9306856
used_memory_dataset:3795112408
used_memory_dataset_perc:72.09%
allocator_allocated:5278263360
allocator_active:5279645696
allocator_resident:5363998720
used_memory_lua:33792
used_memory_lua_human:33.00K
master_repl_offset:240700495402
second_repl_offset:157087304785
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:240699446827
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:18637.736549
used_cpu_user:27639.798232
used_cpu_sys_children:2.001466
used_cpu_user_children:13.251697
used_cpu_sys_main_thread:252.775360
used_cpu_user_main_thread:14353.886320

# Modules

# Errorstats
errorstat_ERR:count=2387

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=20264055,expires=10620831,avg_ttl=1481086241362

The process gets stuck at 31% and never completes. Typical message is as follows:

14:35:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264349/63458641 (11:16:18 / 14:36:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264350/63458641 (11:17:18 / 14:37:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264351/63458641 (11:18:18 / 14:38:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264352/63458641 (11:19:18 / 14:39:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264353/63458641 (11:20:18 / 14:40:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264354/63458641 (11:21:18 / 14:41:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264355/63458641 (11:22:18 / 14:42:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264356/63458641 (11:23:18 / 14:43:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264357/63458641 (11:24:18 / 14:44:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264358/63458641 (11:25:18 / 24:20:45) 492.8/s
14:45:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264359/63458641 (11:26:18 / 24:22:53) 492.1/s
14:46:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264360/63458641 (11:27:18 / 24:25:01) 491.4/s
14:47:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264361/63458641 (11:28:18 / 24:27:09) 490.7/s
14:48:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264362/63458641 (11:29:18 / 24:29:17) 490.0/s
14:49:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264363/63458641 (11:30:18 / 24:31:25) 489.3/s
14:50:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264364/63458641 (11:31:18 / 24:33:33) 488.6/s
14:51:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264365/63458641 (11:32:18 / 24:35:40) 487.9/s
14:52:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
14:52:09.373 INFO io.lettuce.core.protocol.ReconnectionHandler	: Reconnected to target.cache.amazonaws.com:6379

Sometimes I see the following stacktrace error:

15:29:04.475 SEVERE org.springframework.batch.core.step.AbstractStep    : Encountered an error executing step scan-reader in job scan-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.TimeoutException
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
        at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
        at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
        at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
        at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
        at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
        at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
        at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
        at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
        at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
        at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
        at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
        at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
        at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
        at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
        at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
        at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException
        at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
        at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
        at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
        ... 21 more

15:29:04.475 INFO org.springframework.batch.core.step.AbstractStep      : Step: [scan-reader] executed in 1h34m57s282ms

During live replication (--mode live), I'm getting the following error:

17:47:45.573 WARNING com.redis.spring.batch.reader.KeyspaceNotificationItemReader       : Could not add key because queue is full
17:47:46.145 SEVERE org.springframework.batch.core.step.AbstractStep    : Encountered an error executing step live-reader in job live-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
        at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
        at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
        at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
        at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
        at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
        at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
        at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
        at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
        at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
        at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
        at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
        at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
        at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
        at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
        at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
        at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022)
        at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
        at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
        ... 21 more
Caused by: java.lang.OutOfMemoryError: Java heap space

Source & Target cluster both have 2 nodes in the replication group - each one with 16 CPUs & 64GB RAM. Replication instance has 32 CPUs & 128GB RAM.

I tried with many-many options (played around with different --batch sizing, added --no-verify, --skip-policy=ALWAYS, multiple threads) - however nothing is working. Wondering if you can take a look and help me.

Thank you, Nodir

nfayzullaev avatar Apr 21 '23 15:04 nfayzullaev

@nfayzullaev did you manage to solve the issue?

The tool works on a smaller DB (4GB in staging), but fails around 6GB (15GB in prod). We ran into a similar problem, but getting TimeoutExceptions. We tried playing around with multiple parameters, increasing and decreasing timeouts etc, but no luck In AWS we're not exceeding any network bandwidth allowance excess

./riot --info --stacktrace -h source.cache.amazonaws.com -p 6379 replicate --mode=SNAPSHOT -h target.cache.amazonaws.com -p 6379 --read-batch=10 --read-pool=8 --metrics --threads 4 --type=DUMP --timeout=30 --read-threads=2 --wait-timeout=4000 --write-pool=4

16:13:12.427 SEVERE org.springframework.batch.core.step.AbstractStep	: Encountered an error executing step scan-reader in job scan-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.TimeoutException
	at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
	at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
	at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
	at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
	at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
	at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
	at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
	at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
	at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
	at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
	at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
	at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
	at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
	at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
	at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
	at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
	at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
	at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
	at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException
	at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
	at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
	at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
	at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
	at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
	at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
	at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
	... 21 more

16:13:12.429 INFO org.springframework.batch.core.step.AbstractStep	: Step: [scan-reader] executed in 46m34s414ms
16:13:12.445 INFO org.springframework.batch.core.launch.support.SimpleJobLauncher	: Job: [SimpleJob: [name=scan-reader]] completeScanning  34% ━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  4792300/13695788 (0:46:35 / 1:26:32) 1714.6/s
16:13:12.964 INFO org.springframework.batch.core.step.AbstractStep	: Step: [snapshot-replication] executed in 46m36s439ms
16:13:13.068 INFO org.springframework.batch.core.job.SimpleStepHandler	: Executing step: [verification]

CrazyFunker avatar Jun 19 '23 15:06 CrazyFunker

Hello,

I'm running the following command to replicate from 1 Elasticache cluster to another: JAVA_OPTS="-Xmx64g" ./riot-redis -i --stacktrace -h source.cache.amazonaws.com -p 6379 replicate -h target.cache.amazonaws.com -p 6379 --batch 1

Source cluster INFO output:

# Server
redis_version:3.2.4
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
gcc_version:0.0.0
process_id:1
run_id:8cd8a940be86564d937725bc7dc967c9e51a94d7
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
tcp_port:6379
uptime_in_seconds:39663067
uptime_in_days:459
hz:10
lru_clock:4366755
total_commands_processed:2756363612
instantaneous_ops_per_sec:3
total_net_input_bytes:195785379317
total_net_output_bytes:668172690832
instantaneous_input_kbps:0.12
instantaneous_output_kbps:4.46
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:904962796
keyspace_misses:8286
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:slave
master_host:source.cache.amazonaws.com
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:1463228
total_commands_processed:2756363612
instantaneous_ops_per_sec:3
total_net_input_bytes:195785379317
total_net_output_bytes:668172690832
instantaneous_input_kbps:0.12
instantaneous_output_kbps:4.46
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:904962796
keyspace_misses:8286
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:slave
master_host:source.cache.amazonaws.com
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:623807091890
repl_sync_enabled:1
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:17899.54
used_cpu_user:49092.53
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=63463817,expires=33322835,avg_ttl=0
db5:keys=45502,expires=1812,avg_ttl=0

Target cluster INFO output:

# Server
redis_version:6.2.6
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:0.0.0
process_id:1
run_id:1317126a7e9e3af4e759006a6aaafc43fa0913f5
tcp_port:6379
server_time_usec:1682088327009544
uptime_in_seconds:217224
uptime_in_days:2
hz:10
configured_hz:10
lru_clock:4366726
executable:-
config_file:-

# Clients
connected_clients:7
cluster_connections:0
maxclients:65000
client_recent_max_input_buffer:32
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:5273747832
used_memory_human:4.91G
used_memory_rss:5364830208
used_memory_rss_human:5.00G
used_memory_peak:5387511200
used_memory_peak_human:5.02G
used_memory_peak_perc:97.89%
used_memory_overhead:1478635424
used_memory_startup:9306856
used_memory_dataset:3795112408
used_memory_dataset_perc:72.09%
allocator_allocated:5278263360
allocator_active:5279645696
allocator_resident:5363998720
used_memory_lua:33792
used_memory_lua_human:33.00K
master_repl_offset:240700495402
second_repl_offset:157087304785
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:240699446827
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:18637.736549
used_cpu_user:27639.798232
used_cpu_sys_children:2.001466
used_cpu_user_children:13.251697
used_cpu_sys_main_thread:252.775360
used_cpu_user_main_thread:14353.886320

# Modules

# Errorstats
errorstat_ERR:count=2387

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=20264055,expires=10620831,avg_ttl=1481086241362

The process gets stuck at 31% and never completes. Typical message is as follows:

14:35:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264349/63458641 (11:16:18 / 14:36:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264350/63458641 (11:17:18 / 14:37:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264351/63458641 (11:18:18 / 14:38:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264352/63458641 (11:19:18 / 14:39:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264353/63458641 (11:20:18 / 14:40:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264354/63458641 (11:21:18 / 14:41:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264355/63458641 (11:22:18 / 14:42:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264356/63458641 (11:23:18 / 14:43:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264357/63458641 (11:24:18 / 14:44:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264358/63458641 (11:25:18 / 24:20:45) 492.8/s
14:45:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264359/63458641 (11:26:18 / 24:22:53) 492.1/s
14:46:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264360/63458641 (11:27:18 / 24:25:01) 491.4/s
14:47:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264361/63458641 (11:28:18 / 24:27:09) 490.7/s
14:48:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264362/63458641 (11:29:18 / 24:29:17) 490.0/s
14:49:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264363/63458641 (11:30:18 / 24:31:25) 489.3/s
14:50:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264364/63458641 (11:31:18 / 24:33:33) 488.6/s
14:51:09.367 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
Scanning  31% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20264365/63458641 (11:32:18 / 24:35:40) 487.9/s
14:52:09.368 INFO io.lettuce.core.protocol.ConnectionWatchdog	: Reconnecting, last destination was target.cache.amazonaws.com/10.134.95.66:6379
14:52:09.373 INFO io.lettuce.core.protocol.ReconnectionHandler	: Reconnected to target.cache.amazonaws.com:6379

Sometimes I see the following stacktrace error:

15:29:04.475 SEVERE org.springframework.batch.core.step.AbstractStep    : Encountered an error executing step scan-reader in job scan-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.TimeoutException
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
        at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
        at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
        at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
        at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
        at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
        at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
        at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
        at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
        at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
        at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
        at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
        at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
        at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
        at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
        at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
        at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException
        at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
        at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
        at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
        ... 21 more

15:29:04.475 INFO org.springframework.batch.core.step.AbstractStep      : Step: [scan-reader] executed in 1h34m57s282ms

During live replication (--mode live), I'm getting the following error:

17:47:45.573 WARNING com.redis.spring.batch.reader.KeyspaceNotificationItemReader       : Could not add key because queue is full
17:47:46.145 SEVERE org.springframework.batch.core.step.AbstractStep    : Encountered an error executing step live-reader in job live-reader
org.springframework.retry.ExhaustedRetryException: Retry exhausted after last attempt in recovery path, but exception is not skippable.; nested exception is java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$5.recover(FaultTolerantChunkProcessor.java:429)
        at org.springframework.retry.support.RetryTemplate.handleRetryExhausted(RetryTemplate.java:539)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:387)
        at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:255)
        at org.springframework.batch.core.step.item.BatchRetryTemplate.execute(BatchRetryTemplate.java:217)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor.write(FaultTolerantChunkProcessor.java:444)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
        at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
        at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
        at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
        at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
        at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
        at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:375)
        at org.springframework.batch.repeat.support.RepeatTemplate.executeInternal(RepeatTemplate.java:215)
        at org.springframework.batch.repeat.support.RepeatTemplate.iterate(RepeatTemplate.java:145)
        at org.springframework.batch.core.step.tasklet.TaskletStep.doExecute(TaskletStep.java:258)
        at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
        at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
        at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
        at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
        at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
        at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022)
        at com.redis.spring.batch.reader.KeyDumpValueReader.read(KeyDumpValueReader.java:42)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:77)
        at com.redis.spring.batch.reader.AbstractValueReader.process(AbstractValueReader.java:27)
        at com.redis.spring.batch.RedisItemReader$Enqueuer.write(RedisItemReader.java:151)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
        at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
        at org.springframework.batch.core.step.item.FaultTolerantChunkProcessor$3.doWithRetry(FaultTolerantChunkProcessor.java:348)
        at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
        ... 21 more
Caused by: java.lang.OutOfMemoryError: Java heap space

Source & Target cluster both have 2 nodes in the replication group - each one with 16 CPUs & 64GB RAM. Replication instance has 32 CPUs & 128GB RAM.

I tried with many-many options (played around with different --batch sizing, added --no-verify, --skip-policy=ALWAYS, multiple threads) - however nothing is working. Wondering if you can take a look and help me.

Thank you, Nodir

In my environment, I also encountered the same problem as you. Through testing, I found that the reason is that there are hot keys and large keys in the source DB, which makes riot unable to normally listen ; This is why OOM;

hunkjun avatar Sep 27 '23 07:09 hunkjun