cheniujh

China hi, have a good day

Results 12 issues of


                                            cheniujh

毫无征兆地没有相应，要刷新页面，或者可能新建对话，才能再次有响应

毫无征兆地没有响应（一直没有回复），要刷新页面，或者可能新建对话，才能再次有响应，有时甚至一有上下文就卡住没有响应。有时候又完全正常。部署：洛杉矶服务器，centos8 客户端：国内浏览器裸访问服务器请问有人知道怎么回事吗

feat: Add support for dynamicaly reconfig rsync-timeout-ms and throttle-bytes-per-second

1 Add config item rsync-timeout-ms 2 Add support for dynamicaly reconfig rsync-timeout-ms and throttle-bytes-per-second 3 Add CI Test for it fix #2263

✏️ Feature

fix: Reconstruct slave sync thread model

**这个PR做了哪些事**： **1 Slave端主从同步线程模型的重构**（fix #2637 ）： - **1.1** 将Slave端的WriteBinlogWorker和WriteDBWorker分开为两个vector存储，允许用户对write_binlog_worker数量与write_db_worker数量分开配置。具体地，配置项“sync-thread-num”将直接控制从节点消费binlog时用于WriteDB的worker数量（write_db_worker的数量），“sync-binlog-thread-num”决定了write_binlog_worker的数量。 - **1.2** 每个db建议都配一个write_binlog_worker,但也允许用户给出的write_binlog_worker数量小于db数量，此时db会直接取模决定自己使用哪个binlogWorker, 如果用户给出的write_binlog_worker数量大于DB数量，Pika会直接使用db_num作为write_binlog_worker的最终值 - **1.3** 所有DB共用同一个WriteDBWorker Pool来做WriteDB(依旧使用key做hash来选取worker)。 **2 修复了主从超时重连场景下, 因为Slave连续发送两次TrySync Req而导致的Sync Win崩溃问题（fix #2655 ）** - **2.1 针对#2655 最终确定的修补方案是：将Slave端对TrySync Resp的处理从异步改成同步**（使用DB对应的BinlogWorker来处理，确保在主从发生超时重连的场景下，所有过期的Binlog任务都已经被丢弃后Slave才会处理TrySync Resp，这样就能避免Slave消费到SessionID不匹配的过期Binlog任务，进而引发第二次TrySync Req的发送而导致Sync...

☢️ Bug

4.0.0

3.5.5

The slave sync thread model is not reasonable

### Is this a regression? No ### Description 当前，Pika从节点消费Binlog部分的线程模型是： **1** 取conf文件中的sync-thread-num值，产生 sync-thread-num * 2数量的worker线程，这批worker的前一半会被选取来Apply Binlog，后一半Worker用于Apply DB **2** 消费Binlog时，为了保整消费顺序，每个DB的Binlog都确保是同一个worker处理的，此时的worker选取策略是对db_name做hash来得到worker index，从worker vector的前一半中取一个固定的worker **3** 某个worker完成Apply Binlog以后，会使用key做hash来取得index，从worker数组的后一半中取得一个worker，提交异步的WriteDB任务问题在于： **针对1**，用户并不知情pika内部对sync-thread-num乘以了2，是否不太合适，而且这样一来其实用户无法精确控制具体的线程数：比如为了保证WriteDB部分线程数不会太少，该配置项的默认值是6，那么Pika内部就一共有12个Worker，其中前6个用于写Binlog，后6个用于写DB，而在单DB的情况下，前6个worker中有5个是闲置且永远不会被使用的。 **针对2**，使用db_name做hash来取得index，存在倾斜问题，经过实测，在DB数量为8，且sync-thread-num为8的情况下，根据hash映射： DB 1,4,7会都绑定到worker 3上；...

☢️ Bug

If a binlog task blocks the slave for a long period, the master may resume the increment replication from an incorrect start position after timeout reconnection

### Is this a regression? No ### Description **在增量同步的过程中，如果某条BinlogTask阻塞了Slave很长时间（超过20s的超时时间），当主从再次试图重建连接时，Slave可能会提供一个错误的Binlog Start Point给Master来进行续传**。 **针对这一Case的展开描述**: **场景**：假设Slave在增量同步阶段，发生超时转为了TryConnect状态，在发出TrySync请求时，如果相应的BinlogWorker正好还在执行一个上次连接期间的Binlog任务（将该Binlog任务定义为任务A) 。那么，因为任务A的存在，Slave是否可能会发送了不正确的续传起始Offset给Master ？换句话说：虽然最终任务A会在Slave落盘，但是Slave是否可能告知了Master从任务A（所对应的Binlog）开始的位置发送Binlog，导致Master会将任务A对应的Binlog再发一次，且Slave会消费两次任务A所对应的Binlog ？先做回答: 会出现这种问题。具体梳理在下面： **背景补充和定义**：TrySync请求携带的offset(定义为offset A)主要作用是给Master判断是否能做增量同步，真正决定Master初始化发送窗口起始位置（续传Binlog的起始位置）的，是slave在收到TrySync的Resp后发出的一条特殊BinlogACK (is_first_send=true)，该ACK的offst start等于end(定义为Offset B)，且该Offset B就是slave在发出这条特殊Ack时的最新落盘点。在大多数情况下，其实Offset B应该会和Offset A相同，但是在Edge Case 1中，由于发出TrySync时任务A正在执行且最终会执行完毕，Offset...

☢️ Bug

Sync windows corruption may occur when slave try to reconnect to the master after an timeout-casued repl dissconnection between slave and master

### Is this a regression? No ### Description 主从同步时，在发生超时重连的场景下，容易出现BInlog Windown Curruption, 导致主从无法重新正确续传特征： 1 Slave返回的Binlog Ack所携带的Binlog Ack Range跨度特别大，假设正常的BinlogACK跨度为10条BInlog，这个ACK跨度可能是500条BInlog，这明显不合理 2 该BinlogAck的Ack Range中的前80-90%都已经不在窗口，可能只有该Ack Range最后10%所对应的BInlog还在SyncWindow中 In master-slave synchronization, scenarios involving timeout reconnections are...

☢️ Bug

fix: incr sync shouldn't be established after full sync corrupted

这个PR修复了 Issue #2742 中的第二个问题： **问题描述**: 从节点在全量同步失败，RsynClient异常退出之后尝试进行增量连接竟然成功了，这是不对的。 **原因**：一方面，RsyncClient如果异常退出，对于上层主从状态机来说没有别的信号告知，所以才会走了继续尝试增量连接的链路。另一方面，按理说如果全量失败，那么使用全量文件打开一个新RocksDB实例应当也是失败的（RocksDB Apply完manifest文件后会检查内存中的current version中的fileset是否和磁盘上的一致）。但这次case中，全量同步发生中断，恰好只拉取了部分SST文件，RocksDB的CURRENT, MANIFEST文件都没有拉过来，于是在Replace DB的阶段，RocksDB打开新实例时，因为找不到CURRENT文件，会直接起一个空实例，所以没有报错。 **解决方案**： - 1 在RsynClient内部增加error_stop_标志位，如果RsyncClient异常退出（也就是全量同步异常退出，文件没有拉取完毕），就直接删除snapshot所对应的文件夹（./dbsync/dbx） - 2 通过1中删除文件夹的做法，能在不提高RsyncClient和上层耦合的情况下，将错误状态以文件夹不存在的形式传递给上层的从节点状态机，从节点状态机发现snapshot文件夹不存在的话会将SlaveDB状态转为TryConnect进行连接重试 This PR fixes the second issue in Issue #2742: **Problem Description**:...

☢️ Bug

3.5.5

4.0.1

fix: add metric is_eligible_for_master_election to support reelection decision in codis-sentinel

这个PR修复了 fix #2436 1 具体地，通过对外提供指标‘is_eligible_for_master_election’（通过info replication/info命令），告知本实例是否有资格在fail over是成为新master的候选者。 - 该指标特性如下：当前instance如果正在进行全量同步（作为slave），或者运行历史中的上一次全量同步没有做完（自己挂了或者主挂了），is_eligible_for_master_election都会为false，除此之外为true。 2 在codis-sentinel中加入了使用该指标的逻辑： - 2.1 当master挂了需要选新master时，is_eligible_for_master_election为false的节点无权成为新master的候选者 - 2.2 当一个集群选了新主出来，slave切换新主时，如果发现 is_eligible_for_master_election 为false，执行slaveof命令带上force参数，强制进行全量同步 This PR fixes issue #2436: 1. Specifically, it introduces the...

✏️ Feature

☢️ Bug

3.5.5

4.0.1

slave executes the flushdb command extrated from binlog may cause slave-master inconsistent

### Is this a regression? Yes ### Description Pika的Slave在apply完binlog后，将binlog所对应的WriteDB任务分发给多个线程处理，分配时采用按key hash的方式取线程。 flushdb是一个会写binlog，却没有key的命令，每次都会固定取某个线程来执行/ApplyDB。考虑这样一个case：主落盘的顺序/Binlog的顺序是： set key1 a1 set key1 a2 flushdb set key1 a3 主节点最后的状态是：主库中有key1，值为a3 但从节点可能会: thread1执行：set key a1; set key1...

☢️ Bug

the slave occasionally hangs and becomes unresponsive In Github Action's master-slave testing

### Is this a regression? Yes ### Description this can be reproduce if run those CI tests in local machine, looks like related with flushdb operation, still tracing.... ### Please...

☢️ Bug