Tendis
Tendis copied to clipboard
slave中的dump目录下binlog日志不完整,无法回档到指定时间点
Description
http://tendis.cn/#/Tendisplus/%E8%BF%90%E7%BB%B4/backup 参考文档描述,主从结构中,快照配合binlog日志可以回档到任意时间点,但是目前发现slave中的binlog日志并不完整,貌似总有一部分残留在内存中,比如我有10条写记录,但是binlog中只有3条,当我的写记录到达20条时,binlog中大概会出现5条记录,请问这种情况是正常的吗?
Current Behavior
slave信息
> info replication
# Replication
role:slave
master_host:10.2.49.172
master_port:51003
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:26
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:26
rocksdb0_master:ip=10.2.49.172,port=51003,src_store_id=0,state=online,binlog_pos=1,lag=1
rocksdb1_master:ip=10.2.49.172,port=51003,src_store_id=1,state=online,binlog_pos=2,lag=0
rocksdb2_master:ip=10.2.49.172,port=51003,src_store_id=2,state=online,binlog_pos=3,lag=0
rocksdb3_master:ip=10.2.49.172,port=51003,src_store_id=3,state=online,binlog_pos=4,lag=0
rocksdb4_master:ip=10.2.49.172,port=51003,src_store_id=4,state=online,binlog_pos=2,lag=0
rocksdb5_master:ip=10.2.49.172,port=51003,src_store_id=5,state=online,binlog_pos=2,lag=0
rocksdb6_master:ip=10.2.49.172,port=51003,src_store_id=6,state=online,binlog_pos=2,lag=0
rocksdb7_master:ip=10.2.49.172,port=51003,src_store_id=7,state=online,binlog_pos=4,lag=0
rocksdb8_master:ip=10.2.49.172,port=51003,src_store_id=8,state=online,binlog_pos=4,lag=0
rocksdb9_master:ip=10.2.49.172,port=51003,src_store_id=9,state=online,binlog_pos=2,lag=0
binlog查询信息,和info中的binlog_pos总是差一点
# a @ a-pc in ~/WorkSpace/tendisplus/tendisplus-2.1.2-rocksdb-v5.13.4/scripts/slave/dump
$ ../../../bin/binlog_tool --logfile=9/binlog-9-0000001-20210121115133.log
storeid:9 binlogid:1 txnid:34333 chunkid:15759 ts:1611231479844 cmdstr:set
op:1 fkey:i skey: opvalue:9
# a @ a-pc in ~/WorkSpace/tendisplus/tendisplus-2.1.2-rocksdb-v5.13.4/scripts/slave/dump
$ ../../../bin/binlog_tool --logfile=8/binlog-8-0000002-20210121113821.log
storeid:8 binlogid:2 txnid:14092 chunkid:3168 ts:1611212979523 cmdstr:set
op:1 fkey:f skey: opvalue:6
storeid:8 binlogid:3 txnid:35952 chunkid:11958 ts:1611232954249 cmdstr:set
op:1 fkey:q skey: opvalue:17
Steps to Reproduce (for bugs)
新建立一个主从tendis结构,然后在master上set20个key的值,查询slave的binlog日志就可以
Your Environment
- Ubuntu 16.04
- 下载的最新release包
- master-slave都在一台物理机上
# a @ a-pc in ~/WorkSpace/tendisplus/tendisplus-2.1.2-rocksdb-v5.13.4/scripts/slave/dump
$ ../../../bin/binlog_tool --logfile=9/binlog-9-0000001-20210121115133.log
storeid:9 binlogid:1 txnid:34333 chunkid:15759 ts:1611231479844 cmdstr:set
op:1 fkey:i skey: opvalue:9
# a @ a-pc in ~/WorkSpace/tendisplus/tendisplus-2.1.2-rocksdb-v5.13.4/scripts/slave/dump
$ ../../../bin/binlog_tool --logfile=8/binlog-8-0000002-20210121113821.log
storeid:8 binlogid:2 txnid:14092 chunkid:3168 ts:1611212979523 cmdstr:set
op:1 fkey:f skey: opvalue:6
storeid:8 binlogid:3 txnid:35952 chunkid:11958 ts:1611232954249 cmdstr:set
op:1 fkey:q skey: opvalue:17
这里之操作了8/
和9/
两个目录的binlog的其中一个文件
0/
~7/
没有内容?
另外,binlog也有cache,有可能还没有刷到磁盘中,可以执行binlogflush
强制刷盘确实是否存在buffer中
@TendisDev 1/
~7/
也是有内容的,不过对照我测试的数据总是会缺少一些,执行 binlogflush
后binlog文件没有发生任何变化,但是当我又设置了4个新key的值时,binlog日志中出现了两条新设置的值和两条很久以前设置的值,这种表现就像是cache中总会残留一部分的binlog不刷新到磁盘,必须通过新的数据变化来把之前的binlog刷新到磁盘,难道是操作的数据量太小了?
可以将kvstorecount 1
加入到配置中,只有一个rocksdb,再试一下
这样所有binlog都会一个目录中,便于分析
然后将执行的命令,配置,binlogtool的结果都发出来一起看看
@TendisDev 配置kvstorecount 1
之后缺少了最后一条binlog
mater操作
10.2.49.172:51003> info replication
# Replication
role:master
connected_slaves:1
slave0:ip=10.2.49.172,port=51004,state=online,offset=0,lag=0
master_repl_offset:0
rocksdb0_slave0:ip=10.2.49.172,port=51004,dest_store_id=0,state=online,binlog_pos=0,lag=0,binlog_lag=0
10.2.49.172:51003> set a 1
OK
10.2.49.172:51003> set a 2
OK
10.2.49.172:51003> set a 3
OK
10.2.49.172:51003> set a 1
OK
10.2.49.172:51003> set b 2
OK
10.2.49.172:51003> set c 3
OK
10.2.49.172:51003> set d 4
OK
10.2.49.172:51003> info replication
# Replication
role:master
connected_slaves:1
slave0:ip=10.2.49.172,port=51004,state=online,offset=7,lag=0
master_repl_offset:7
rocksdb0_slave0:ip=10.2.49.172,port=51004,dest_store_id=0,state=online,binlog_pos=7,lag=0,binlog_lag=0
slave操作
10.2.49.172:51004> slaveof 10.2.49.172 51003
OK
10.2.49.172:51004> info replication
# Replication
role:slave
master_host:10.2.49.172
master_port:51003
master_link_status:up
master_last_io_seconds_ago:339
master_sync_in_progress:0
slave_repl_offset:0
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
rocksdb0_master:ip=10.2.49.172,port=51003,src_store_id=0,state=online,binlog_pos=0,lag=1611647851
10.2.49.172:51004> info replication
# Replication
role:slave
master_host:10.2.49.172
master_port:51003
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:7
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:7
rocksdb0_master:ip=10.2.49.172,port=51003,src_store_id=0,state=online,binlog_pos=7,lag=0
10.2.49.172:51004> get a
"1"
10.2.49.172:51004> get d
"4"
10.2.49.172:51004> binlogflush all
OK
(1.03s)
查询binlog日志得到如下内容:
# a @ a-pc in ~/WorkSpace/tendisplus/tendisplus-2.1.2-rocksdb-v5.13.4/scripts/slave/dump/0 [15:59:53]
$ ../../../../bin/binlog_tool --logfile=binlog-0-0000001-20210126155152.log
storeid:0 binlogid:1 txnid:75 chunkid:15495 ts:1611647885630 cmdstr:set
op:1 fkey:a skey: opvalue:1
storeid:0 binlogid:2 txnid:81 chunkid:15495 ts:1611647888862 cmdstr:set
op:1 fkey:a skey: opvalue:2
storeid:0 binlogid:3 txnid:89 chunkid:15495 ts:1611647891919 cmdstr:set
op:1 fkey:a skey: opvalue:3
storeid:0 binlogid:4 txnid:104 chunkid:15495 ts:1611647898046 cmdstr:set
op:1 fkey:a skey: opvalue:1
storeid:0 binlogid:5 txnid:121 chunkid:3300 ts:1611647905140 cmdstr:set
op:1 fkey:b skey: opvalue:2
storeid:0 binlogid:6 txnid:129 chunkid:7365 ts:1611647907905 cmdstr:set
op:1 fkey:c skey: opvalue:3
即使执行了 binlogflush all
命令,也缺少set d 4
这个操作的binlog日志,当kvstorecount
的值越大时,这种缺失就越明显
因为Tendis目前实现上总保留了一个binlog在rocksdb中,由这个参数控制slaveBinlogKeepNum
,所以最后一个没有导出。
如果想导出,可以保证有持续的操作,或者额外写一个监控数据产生一条binlog
后续计划增加一个心跳binlog来保证长时间没有操作的实例,binlog可以及时导出。
我看到文档中这个参数默认是1,也就是说每个rocksdb会保留一条,我改成0会有什么影响吗?生产环境中额外写一条监控数据也无法准确对应到每个rocksdb啊,当然在持续操作的系统中应该没有问题
目前这个最小值就是1,这个问题我们优化下
thanks O(∩_∩)O