raft
raft copied to clipboard
A compound entry slice type
raft
keeps entries in memory as a single slice: https://github.com/etcd-io/raft/blob/3e6cb625f59ca9da34d03df484217d874340e794/log_unstable.go#L37
In the happy path (when the leader is stable), it just keeps appending to this slice: https://github.com/etcd-io/raft/blob/3e6cb625f59ca9da34d03df484217d874340e794/log_unstable.go#L201
In the unhappy path (when leader changes), it rewrites a suffix of the log by copying a prefix and appending the suffix: https://github.com/etcd-io/raft/blob/3e6cb625f59ca9da34d03df484217d874340e794/log_unstable.go#L212-L213
As a result, we again have a contiguous slice of entries. We do this copying because we need entries[i]
to be immutable (because it may have been sent up to Storage
through the Ready
struct, and is still in use / in flight).
Same reallocation/copying happens in MemoryStorage
: https://github.com/etcd-io/raft/blob/3e6cb625f59ca9da34d03df484217d874340e794/storage.go#L300-L302
Another place where we have to reallocate/copy the slice of entries is the slice
method where we combine the entries returned from Storage
with the entries returned from the unstable
structure: https://github.com/etcd-io/raft/blob/3e6cb625f59ca9da34d03df484217d874340e794/log.go#L494-L497
In all these cases, it is not strictly necessary to reallocate and copy []pb.Entry
slices. Instead, it would be acceptable to operate on [][]pb.Entry
, or a struct that has [][]pb.Entry
inside, but provides an interface of a contiguous slice.
Most usages of []pb.Entry
need to be able to iterate it, so it's fine if it's multiple slices instead of one. Some usages may need random access by entry.Index
, so it's fine if there are a few slices, and it's good if access by Index
is safely encapsulated.
Prototype:
// Entries represents a span of the log. Functionally it is
// equivalent to []pb.Entry.
//
// Internally, it consists of a few []pb.Entry slices, ordered
// by its 0-th entry Index. By properties of the Raft log, it is
// also ordered by the 0-th entry Term. Example:
//
// Entries: (-----------------------]
// . . . . .
// t9: │ . . . (-------]
// t7: │ . . (---]-------]
// t4: │ . (---]
// t3: │ (-----------------]
// └───────────────────────────────────► index
// 10 20 30 40 50 60 70
//
// Most of the time, there will be one (when leader is stable) or
// two (when leader changes) slices in it. The struct is optimized
// for this case, and incurs zero allocations. Occasionally it may
// accumulate more slices (e.g. if the leader is unstable).
type Entries struct {
buf [3][]pb.Entry // used to back `ent` if len(ent) <= 3
ent [][]pb.Entry // Invariant: len(ent[i]) != 0
}
// Bounds returns the [begin, end) range of log indices that this
// slice represents.
func (e Entries) Bounds() (uint64, uint64) {
if len(e.ent) == 0 {
return 0, 0
}
last := e.ent[len(e.ent)-1]
return e.ent[0][0].Index, last[len(last)-1].Index+1
}
// Len returns the length of the log span.
func (e Entries) Len() int {
begin, end := e.Bounds()
return end - begin
}
// Iterate visits all entries in this log span.
func (e Entries) Iterate(visit func([]pb.Entry) bool) {
l := len(e.ent)
if l == 0 {
return
}
for i := 0; i + 1 < l; i++ {
// Compute the visible part of this slice, which is not
// overlapped by the next slice.
visible := e.ent[i+1].Index - e.ent[i].Index
if !visit(e.ent[i][:visible:visible]) { // protect from appends
return
}
}
visit(e.ent[l-1])
}
// To converts Entries to the equivalent []pb.Entry.
// Just in case it's needed. For instance:
//
// entries := e.To(make([]pb.Entry, 0, e.Len()))
func (e Entries) To(to []pb.Entry) []pb.Entry {
e.Iterate(func(part []pb.Entry) bool {
to = append(to, part...)
return true
})
return to
}
// Need to also have mutators to append to and truncate this slice.
// Could also add a method to access a log entry by Index. This would
// be a binary search by e.ent[i][0].Index (or a linear search if
// len(e.ent) is low, which is normally true), followed by a direct
// access by index within the found sub-slice.
// Could also add a method to access a [begin, end) sub-slice of this
// slice. It would boil down to finding the begin and end-1 index, and
// then either returning a sub-slice with zero copy if it's in a single
// sub-slice, or returning a copy if it is compound.
// Squash can be used to "compact" the log range to a single slice.
// For example, it could be used in MemoryStorage occasionally, once
// len(e.ent) grows above a threshold and increases the cost of random
// access.
func (e Entries) Squash() Entries {
if len(e.ent) <= 1 {
return e
}
var res Entries
res.buf[0] = e.To(make([]pb.Entry, 0, e.Len()))
res.ent = res.buf[:1]
return res
}
Why something like this struct is useful:
- Avoids some unnecessary allocations. In the
unstable
struct, the overlapped entries on leader changes are short-lived, and will be released to the heap soon anyway. The overlapped entries are also likely still referenced because they were returned throughReady
for being stored. So copying the slice (as of now) actually increases memory consumption. - It's possible to make this struct completely reallocation-free if we always start a new slice when the old slice is filled up.
- It might be easier to track used memory this way, there would be no hidden refs to copies of slices.
- Provides protected / type-safe access to a slice of log entries.
- Can come with extra guarantees that the slices of log entries are verified, e.g. that
Index
is contiguous, andTerm
is non-decreasing.
Alternatively, instead of [][]pb.Entry
could just do all in one []pb.Entry
, with a little “index” on a side. The "index" would be equivalent to the AppendTracker that would be needed for async storage appends.
Say, we have append [1 2 3 4 5]
, then an override [4 5 6 7]
, then append again [8, 9]
. The “logical” evolution of the log is:
1 2 3 4 5
1 2 3 4' 5' 6 7
1 2 3 4' 5' 6 7 8 9
But physically it could be represented as:
1 2 3 4 5 4' 5' 6 7 8 9
And an “index”:
log index 1-3 is at array positions 0-2
log index 4-9 is at array positions 5-10
The unstable
is quickly rotated away, so the backing array will be shrunken, and “index” will be usually 1-2 entries.
hi @pavelkalinnikov, can I pick this up?