braft issues

快照逻辑在极端场景下存在数据无法正常恢复的风险

1

目前的快照逻辑在磁盘满的情况下存在异常, 无法生成新的snapshot, 但是却将本地的braft日志给清理掉了, 这里导致重启的时候故障恢复期间一直校验失败退出, 这个问题的详细描述和修复在 #461 中进行跟进.

NodeImpl重构

NodeImpl 类通过继承 butil::RefCountedThreadSafe ，将析构函数设为私有，从而确保对象的生命周期由线程安全的引用计数机制（RefCountedThreadSafe）自动管理。而RefCountedThreadSafe的机制非常类似于智能指针，为什么不使用智能指针重构NodeImpl呢

saz97

LogManager中的_last_snapshot_id并没有持久化，所以每次启动时_last_snapshot_id都为0 在以下场景中： 1. ABC三副本 2. AB启动后写入数据 3. AB停止写入数据 4. 启动C 5. leader向C发送快照 6. ABC全部重启，并且A或B选为Leader 7. Leader向C写入become leader时的那个config时，C中没有任何的Log。所以对于任意的Index，get_term时返回的都是0,，会再触发一次snapshot。不知道上述场景理解的正不正确。如果正确的话是不是意味着如果没有新增数据C每次重启时都要收一遍snapshot？

cangfengzhs

NodeImpl::_mutex死锁

2

#### braft 版本 commit id：3cae30fb67cb9e988650500522c6d64ae609f2aa #### 现象 braft内部线程都阻塞在对同一个NodeImpl::_mutex（地址 0x38a4b48）加锁操作上，通过gdb查看锁的持有者，发现死锁了。 ![KxXRgJ5KjP](https://github.com/baidu/braft/assets/8337902/78c0cacf-d7a5-4ab0-9f65-a9fa7016b014) 补充信息： 1. 业务有线程分别阻塞在 Node::is_leader_lease_valid ，Node::apply ，Node::get_status调用，内部也是在等锁 2. 主要堆栈 Thread 54 (Thread 0x7f5145ffb640 (LWP 67) "worker-0"): #0 0x00007f5204401560 in __lll_lock_wait...

amoxic

fix peerid

PeerId 增加了 role，但是它的 operator < 和 operator == 没有考虑 role。 PeerId 的 operator < 被 std::map 使用(包括 Configuration/node_manager)。没有考虑 role 会导新建一个 addr、idx 相同但是角色不同的节点时失败。

CkTD

support diasble election in unit-test

fix:: https://github.com/baidu/braft/issues/445 In unit-test, we should temporarily disable followers' election before stop the leader to ensure that the raft group will be no-leader state for a period of time deterministically

JayiceZ

fix use after free in handle_timeout_now_request

Closure maybe invoked before `elect_self`, which will free request and cause UAF.

BusyJay

fix the bug that causes the crash due to the wrong format of the PeerID

处理cli_service请求时，修复因为错误传输PeerID格式，而导致节点crash的问题。

LIBA-S

数据同步失败

将某个节点的数据全部删除后，无法同步最新数据，报错如下 I20240730 19:40:50.470150 54291 snapshot.cpp:519] Deleting /mnt/fast_pool/evdb/44/snapshot/temp I20240730 19:40:50.595078 54291 node.cpp:2661] node 44:10.210.146.22:31650:0:0 received InstallSnapshotRequest last_included_log_index=64943885 last_include_log_term=517 from 10.210.146.19:31650:0:0 when last_log_id=(index=0,term=0) I20240730 19:40:54.846653 54300 node.cpp:1616] node 44:10.210.146.22:31650:0:0 term 517 start...

helloworld0xff

leader can still accept read requests when transferring leader

Because new leader has to disrupt the old leader before it can receive votes from other candidates, so old leader will always has the latest data before it votes for...

BusyJay

braft
braft copied to clipboard

Metadata

快照逻辑在极端场景下存在数据无法正常恢复的风险

NodeImpl重构

如果没有新增数据是否会每次启动时重复发送快照？

NodeImpl::_mutex死锁

fix peerid

support diasble election in unit-test

fix use after free in handle_timeout_now_request

fix the bug that causes the crash due to the wrong format of the PeerID

数据同步失败

leader can still accept read requests when transferring leader

← Metadata

Owner

Metadata

braft braft copied to clipboard

Metadata

← Metadata

Owner

Metadata

braft
braft copied to clipboard