ardb Snapshot dump failed with LIST data type

Hi, @yinqiwen we discussed this issue briefly in the other snapshot issue thread. Just wanted to move this here so that it doesn't get lost since its not related to the other issue.

Encountered this error while using BGSAVE to take a snapshot.

[6] 10-28 21:38:15,031 ERROR Previous key:message:1:priority:high with type:5 is not complete dump, 3000 missing in 33482586 elements
[6] 10-28 21:38:15,031 ERROR Failed to write dump file for reason code:-2
[6] 10-28 21:38:15,031 WARN Failed to save snapshot file with type:redis
[6] 10-28 21:38:15,031 WARN Remove snapshot file:/var/lib/ardb-backup/save-redis-snapshot.1509226497281.tmp

These are the commands that are used to remove items from the list which are done in a pipeline.

1509241723.698964 "lrange" "message:1:priority:high" "0" "2999"
1509241723.725610 "ltrim" "message:1:priority:high" "3000" "-1"

Let me know if there is any other info I can provide. I'll investigate to see if there are any other clues.

Oct 30 '17 00:10 bernardko

which write commands are u used to modify the lists?

Oct 30 '17 08:10 yinqiwen

any possible that saving snapshot & ltrim executed at the same time?

Oct 30 '17 08:10 yinqiwen

@yinqiwen I'm using redis-py python client and executing lrange and ltrim in a pipeline command. That should send both commands to the server to be executed together.

The LIST dataset I have are massive so it takes a very long time to save the snapshot. HIghly likely, ltrim was executed while it was saving the snapshot

Oct 30 '17 14:10 bernardko

I noticed that if I push more items into those lists, that error goes away and I can save a snapshot again. I was able to successfully dump a snapshot of those big lists and import without issues after I pushed more items onto both lists which I guess fixed the problem with the element count.

Thanks for the commit, I will test it out.

Oct 30 '17 14:10 bernardko

I just built the newest commit and tried to save a snapshot but it completes immediately and the snapshot file is only 20 bytes.

[7] 10-30 15:02:08,137 INFO Close write latch for snapshot save preparing start.
[7] 10-30 15:02:08,137 INFO Open write latch for snapshot save preparing complete.
[7] 10-30 15:02:08,137 INFO Cost 0s to save snapshot file with type:redis.

Oct 30 '17 15:10 bernardko

Thanks, will give it a test run.

Oct 31 '17 14:10 bernardko

Just tested the latest commit and unfortunately the problem still persists.

To trigger this, I executed BGSAVE and then while it is saving the snapshot, I ran the lrange and ltrim commands on the LIST.

Please find the logs below:

[1] 11-01 08:10:59,834 INFO Close write latch for snapshot save preparing start.
[1] 11-01 08:10:59,834 INFO Open write latch for snapshot save preparing complete.
[1] 11-01 08:20:00,827 INFO ========================Period Statistics Dump Begin===========================
[1] 11-01 08:20:00,827 INFO coststat_bgsave_all:calls=1,costs=7644,cost_per_call=7644,percents=100.0000%
[1] 11-01 08:20:00,827 INFO coststat_bgsave_range[5000-10000]:calls=1,costs=7644,cost_per_call=7644,percents=100.0000%
[1] 11-01 08:20:00,827 INFO slave_sync_total_commands_processed:0
[1] 11-01 08:20:00,827 INFO slave_sync_instantaneous_ops_per_sec:0
[1] 11-01 08:20:00,827 INFO total_commands_processed:1
[1] 11-01 08:20:00,827 INFO instantaneous_ops_per_sec:0
[1] 11-01 08:20:00,827 INFO total_connections_received:1
[1] 11-01 08:20:00,827 INFO rejected_connections:0
[1] 11-01 08:20:00,827 INFO ========================Period Statistics Dump End===========================
[1] 11-01 08:28:58,064 ERROR Previous key:message:1:priority:high with type:5 is not complete dump, 18000 missing in 33492512 elements
[1] 11-01 08:28:58,064 ERROR Failed to write dump file for reason code:-2
[1] 11-01 08:30:00,212 INFO ========================Period Statistics Dump Begin===========================
[1] 11-01 08:30:00,212 INFO coststat_multi_all:calls=5,costs=0,cost_per_call=0,percents=100.0000%
[1] 11-01 08:30:00,212 INFO coststat_multi_range[0-1000]:calls=5,costs=0,cost_per_call=0,percents=100.0000%
[1] 11-01 08:30:00,212 INFO coststat_exec_all:calls=5,costs=472385,cost_per_call=94477,percents=100.0000%
[1] 11-01 08:30:00,212 INFO coststat_exec_range[0-1000]:calls=1,costs=48,cost_per_call=48,percents=20.0000%
[1] 11-01 08:30:00,212 INFO coststat_exec_range[20000-50000]:calls=1,costs=34384,cost_per_call=34384,percents=20.0000%
[1] 11-01 08:30:00,212 INFO coststat_exec_range[50000-100000]:calls=1,costs=51430,cost_per_call=51430,percents=20.0000%
[1] 11-01 08:30:00,212 INFO coststat_exec_range[100000-200000]:calls=1,costs=113537,cost_per_call=113537,percents=20.0000%
[1] 11-01 08:30:00,212 INFO coststat_exec_range[200000-500000]:calls=1,costs=272986,cost_per_call=272986,percents=20.0000%
[1] 11-01 08:30:00,212 INFO coststat_select_all:calls=2,costs=1,cost_per_call=0,percents=100.0000%
[1] 11-01 08:30:00,212 INFO coststat_select_range[0-1000]:calls=2,costs=1,cost_per_call=0,percents=100.0000%
[1] 11-01 08:30:00,212 INFO coststat_lrange_all:calls=5,costs=401352,cost_per_call=80270,percents=100.0000%
[1] 11-01 08:30:00,212 INFO coststat_lrange_range[0-1000]:calls=1,costs=29,cost_per_call=29,percents=20.0000%
[1] 11-01 08:30:00,212 INFO coststat_lrange_range[20000-50000]:calls=2,costs=68853,cost_per_call=34426,percents=40.0000%
[1] 11-01 08:30:00,212 INFO coststat_lrange_range[100000-200000]:calls=1,costs=105008,cost_per_call=105008,percents=20.0000%
[1] 11-01 08:30:00,212 INFO coststat_lrange_range[200000-500000]:calls=1,costs=227462,cost_per_call=227462,percents=20.0000%
[1] 11-01 08:30:00,212 INFO coststat_ltrim_all:calls=5,costs=67478,cost_per_call=13495,percents=100.0000%
[1] 11-01 08:30:00,212 INFO coststat_ltrim_range[0-1000]:calls=1,costs=10,cost_per_call=10,percents=20.0000%
[1] 11-01 08:30:00,212 INFO coststat_ltrim_range[5000-10000]:calls=3,costs=23057,cost_per_call=7685,percents=60.0000%
[1] 11-01 08:30:00,212 INFO coststat_ltrim_range[20000-50000]:calls=1,costs=44411,cost_per_call=44411,percents=20.0000%
[1] 11-01 08:30:00,212 INFO slave_sync_total_commands_processed:0
[1] 11-01 08:30:00,212 INFO slave_sync_instantaneous_ops_per_sec:0
[1] 11-01 08:30:00,212 INFO total_commands_processed:23
[1] 11-01 08:30:00,212 INFO instantaneous_ops_per_sec:0
[1] 11-01 08:30:00,212 INFO total_connections_received:3
[1] 11-01 08:30:00,212 INFO rejected_connections:0
[1] 11-01 08:30:00,212 INFO ========================Period Statistics Dump End===========================
[1] 11-01 08:31:29,147 WARN Failed to save snapshot file with type:redis
[1] 11-01 08:31:29,153 WARN Remove snapshot file:/app/backup/crawl/save-redis-snapshot.1509523859826.tmp

Nov 01 '17 08:11 bernardko

@bernardko there is no around transaction for LTRIM's write operations which seems the root cause of this issue, fixed in latest commit.

Nov 01 '17 12:11 yinqiwen

@yinqiwen Just tested this again using fbcf4c26600930ed3e3ad67bf334afba14cc6dca and it still seems to be happening.

Let me know if there is any other info I can provide. Thanks!

[1] 11-01 17:14:00,985 INFO Close write latch for snapshot save preparing start.
[1] 11-01 17:14:00,985 INFO Open write latch for snapshot save preparing complete.
[1] 11-01 17:20:00,376 INFO ========================Period Statistics Dump Begin===========================
[1] 11-01 17:20:00,376 INFO coststat_multi_all:calls=4,costs=2,cost_per_call=0,percents=100.0000%
[1] 11-01 17:20:00,376 INFO coststat_multi_range[0-1000]:calls=4,costs=2,cost_per_call=0,percents=100.0000%
[1] 11-01 17:20:00,376 INFO coststat_exec_all:calls=4,costs=178859,cost_per_call=44714,percents=100.0000%
[1] 11-01 17:20:00,376 INFO coststat_exec_range[5000-10000]:calls=1,costs=6729,cost_per_call=6729,percents=25.0000%
[1] 11-01 17:20:00,376 INFO coststat_exec_range[20000-50000]:calls=1,costs=30738,cost_per_call=30738,percents=25.0000%
[1] 11-01 17:20:00,376 INFO coststat_exec_range[50000-100000]:calls=2,costs=141392,cost_per_call=70696,percents=50.0000%
[1] 11-01 17:20:00,376 INFO coststat_bgsave_all:calls=1,costs=30695,cost_per_call=30695,percents=100.0000%
[1] 11-01 17:20:00,376 INFO coststat_bgsave_range[20000-50000]:calls=1,costs=30695,cost_per_call=30695,percents=100.0000%
[1] 11-01 17:20:00,376 INFO coststat_select_all:calls=6,costs=2,cost_per_call=0,percents=100.0000%
[1] 11-01 17:20:00,376 INFO coststat_select_range[0-1000]:calls=6,costs=2,cost_per_call=0,percents=100.0000%
[1] 11-01 17:20:00,376 INFO coststat_lrange_all:calls=4,costs=165385,cost_per_call=41346,percents=100.0000%
[1] 11-01 17:20:00,376 INFO coststat_lrange_range[1000-5000]:calls=1,costs=3615,cost_per_call=3615,percents=25.0000%
[1] 11-01 17:20:00,376 INFO coststat_lrange_range[20000-50000]:calls=1,costs=27255,cost_per_call=27255,percents=25.0000%
[1] 11-01 17:20:00,376 INFO coststat_lrange_range[50000-100000]:calls=2,costs=134515,cost_per_call=67257,percents=50.0000%
[1] 11-01 17:20:00,376 INFO coststat_ltrim_all:calls=4,costs=10295,cost_per_call=2573,percents=100.0000%
[1] 11-01 17:20:00,376 INFO coststat_ltrim_range[1000-5000]:calls=4,costs=10295,cost_per_call=2573,percents=100.0000%
[1] 11-01 17:20:00,376 INFO coststat_keys_all:calls=2,costs=87322,cost_per_call=43661,percents=100.0000%
[1] 11-01 17:20:00,376 INFO coststat_keys_range[0-1000]:calls=1,costs=184,cost_per_call=184,percents=50.0000%
[1] 11-01 17:20:00,376 INFO coststat_keys_range[50000-100000]:calls=1,costs=87138,cost_per_call=87138,percents=50.0000%
[1] 11-01 17:20:00,376 INFO slave_sync_total_commands_processed:0
[1] 11-01 17:20:00,376 INFO slave_sync_instantaneous_ops_per_sec:0
[1] 11-01 17:20:00,376 INFO total_commands_processed:27
[1] 11-01 17:20:00,376 INFO instantaneous_ops_per_sec:0
[1] 11-01 17:20:00,376 INFO total_connections_received:7
[1] 11-01 17:20:00,376 INFO rejected_connections:0
[1] 11-01 17:20:00,376 INFO ========================Period Statistics Dump End===========================
[1] 11-01 17:27:01,984 ERROR Previous key:message:1:priority:high with type:5 is not complete dump, 12000 missing in 34065360 elements
[1] 11-01 17:27:01,984 ERROR Failed to write dump file for reason code:-2
[1] 11-01 17:29:14,364 WARN Failed to save snapshot file with type:redis
[1] 11-01 17:29:14,370 WARN Remove snapshot file:/app/backup/crawl/save-redis-snapshot.1509556440954.tmp

Nov 01 '17 17:11 bernardko

@yinqiwen Actually, let me retry this test. I think the way I tested just reflects the state of my database. I will try to do a full snapshot save first without running the commands, make sure that works and then do one while running the ltrim commands.

After looking at the missing numbers, I realized that I need to make sure nothing is missing in the LISTs first before testing. Will report back.

Nov 01 '17 19:11 bernardko

Just noticed I was able to call BGSAVE multiple times and I see that multiple snapshots are being saved, is this normal behavior?

[1] 11-02 04:21:35,500 INFO Close write latch for snapshot save preparing start.
[1] 11-02 04:21:35,500 INFO Open write latch for snapshot save preparing complete.
[1] 11-02 04:22:26,753 INFO Close write latch for snapshot save preparing start.
[1] 11-02 04:22:26,753 INFO Open write latch for snapshot save preparing complete.
[1] 11-02 04:22:53,110 INFO Close write latch for snapshot save preparing start.
[1] 11-02 04:22:53,110 INFO Open write latch for snapshot save preparing complete.
[1] 11-02 04:23:30,290 INFO Close write latch for snapshot save preparing start.
[1] 11-02 04:23:30,290 INFO Open write latch for snapshot save preparing complete.

Nov 02 '17 04:11 bernardko

yes, there is no limit for starting multiple bgsave tasks.

Nov 03 '17 01:11 yinqiwen

Right, thanks I remembered incorrectly. I'm still working on getting this tested properly, will get back to you shortly.

Nov 03 '17 07:11 bernardko

@yinqiwen After a successful snapshot while there were no writes, I tried to running the processes again and taking a snapshot. However, the error is still coming up.

[1] 11-06 08:42:50,956 ERROR Previous key:message:1:priority:high with type:5 is not complete dump, 293 missing in 34850895 elements
[1] 11-06 08:42:50,956 ERROR Failed to write dump file for reason code:-2
[1] 11-06 08:44:13,905 WARN Failed to save snapshot file with type:redis
[1] 11-06 08:44:13,909 WARN Remove snapshot file:/app/backup/crawl/save-redis-snapshot.1509956553952.tmp

Nov 06 '17 08:11 bernardko

@bernardko fixed a bug that the key deletion is not batched in LTRIM

Nov 06 '17 09:11 yinqiwen

Thanks, I'll test it out.

Nov 06 '17 10:11 bernardko

Hi @yinqiwen I finally had a chance to come back to testing this issue.

Do I need to stop all read and writes to ardb before attempting a snapshot? Would this work better if I replicated to a slave db and then snapshot from there instead?

These are some errors I got when trying to snapshot using BGSAVE 3 different running ardb databases:

[1] 02-03 19:33:43,385 ERROR Previous key:request:status:200 with type:7 is not complete dump, 134 missing in 1228288 elements
[1] 02-03 19:33:43,385 ERROR Failed to write dump file for reason code:-2
[1] 02-03 19:33:49,236 WARN Failed to save snapshot file with type:redis

[1] 02-03 19:13:31,808 ERROR Previous key:queue:1:transfer:high with type:5 is not complete dump, 2529 missing in 1433560 elements
[1] 02-03 19:13:31,808 ERROR Failed to write dump file for reason code:-2
[1] 02-03 19:14:01,186 WARN Failed to save snapshot file with type:redis

[1] 02-03 18:46:47,062 INFO Close write latch for snapshot save preparing start.
[1] 02-03 18:46:47,063 INFO Open write latch for snapshot save preparing complete.
[1] 02-03 18:46:58,555 ERROR Previous key:Category:name:idx with type:9 is not complete dump, 1 missing in 93004 elements
[1] 02-03 18:46:58,555 ERROR Failed to write dump file for reason code:-2
[1] 02-03 18:47:45,886 WARN Failed to save snapshot file with type:redis

Thanks for your help!

Feb 03 '18 20:02 bernardko

ardb ardb copied to clipboard

Snapshot dump failed with LIST data type

ardb
ardb copied to clipboard