raft icon indicating copy to clipboard operation
raft copied to clipboard

A compound entry slice type

Open pav-kv opened this issue 1 year ago • 2 comments

raft keeps entries in memory as a single slice: https://github.com/etcd-io/raft/blob/3e6cb625f59ca9da34d03df484217d874340e794/log_unstable.go#L37

In the happy path (when the leader is stable), it just keeps appending to this slice: https://github.com/etcd-io/raft/blob/3e6cb625f59ca9da34d03df484217d874340e794/log_unstable.go#L201

In the unhappy path (when leader changes), it rewrites a suffix of the log by copying a prefix and appending the suffix: https://github.com/etcd-io/raft/blob/3e6cb625f59ca9da34d03df484217d874340e794/log_unstable.go#L212-L213

As a result, we again have a contiguous slice of entries. We do this copying because we need entries[i] to be immutable (because it may have been sent up to Storage through the Ready struct, and is still in use / in flight).

Same reallocation/copying happens in MemoryStorage: https://github.com/etcd-io/raft/blob/3e6cb625f59ca9da34d03df484217d874340e794/storage.go#L300-L302

Another place where we have to reallocate/copy the slice of entries is the slice method where we combine the entries returned from Storage with the entries returned from the unstable structure: https://github.com/etcd-io/raft/blob/3e6cb625f59ca9da34d03df484217d874340e794/log.go#L494-L497

In all these cases, it is not strictly necessary to reallocate and copy []pb.Entry slices. Instead, it would be acceptable to operate on [][]pb.Entry, or a struct that has [][]pb.Entry inside, but provides an interface of a contiguous slice.

Most usages of []pb.Entry need to be able to iterate it, so it's fine if it's multiple slices instead of one. Some usages may need random access by entry.Index, so it's fine if there are a few slices, and it's good if access by Index is safely encapsulated.


Prototype:

// Entries represents a span of the log. Functionally it is
// equivalent to []pb.Entry.
//
// Internally, it consists of a few []pb.Entry slices, ordered
// by its 0-th entry Index. By properties of the Raft log, it is
// also ordered by the 0-th entry Term. Example:
//
//	Entries:  (-----------------------]
//	          .   .   .       .       .
//	t9: │     .   .   .       (-------]
//	t7: │     .   .   (---]-------]
//	t4: │     .   (---]
//	t3: │     (-----------------]
//	    └───────────────────────────────────► index
//	         10  20  30  40  50  60  70
//
// Most of the time, there will be one (when leader is stable) or
// two (when leader changes) slices in it. The struct is optimized
// for this case, and incurs zero allocations. Occasionally it may
// accumulate more slices (e.g. if the leader is unstable).
type Entries struct {
	buf [3][]pb.Entry // used to back `ent` if len(ent) <= 3
	ent [][]pb.Entry  // Invariant: len(ent[i]) != 0
}

// Bounds returns the [begin, end) range of log indices that this
// slice represents.
func (e Entries) Bounds() (uint64, uint64) {
	if len(e.ent) == 0 {
		return 0, 0
	}
	last := e.ent[len(e.ent)-1]
	return e.ent[0][0].Index, last[len(last)-1].Index+1
}

// Len returns the length of the log span.
func (e Entries) Len() int {
	begin, end := e.Bounds()
	return end - begin
}

// Iterate visits all entries in this log span.
func (e Entries) Iterate(visit func([]pb.Entry) bool) {
	l := len(e.ent)
	if l == 0 {
		return
	}
	for i := 0; i + 1 < l; i++ {
		// Compute the visible part of this slice, which is not
		// overlapped by the next slice.
		visible := e.ent[i+1].Index - e.ent[i].Index
		if !visit(e.ent[i][:visible:visible]) { // protect from appends
			return
		}
	}
	visit(e.ent[l-1])
}

// To converts Entries to the equivalent []pb.Entry.
// Just in case it's needed. For instance:
//
//	entries := e.To(make([]pb.Entry, 0, e.Len()))
func (e Entries) To(to []pb.Entry) []pb.Entry {
	e.Iterate(func(part []pb.Entry) bool {
		to = append(to, part...)
		return true
	})
	return to
}

// Need to also have mutators to append to and truncate this slice.

// Could also add a method to access a log entry by Index. This would
// be a binary search by e.ent[i][0].Index (or a linear search if
// len(e.ent) is low, which is normally true), followed by a direct
// access by index within the found sub-slice.

// Could also add a method to access a [begin, end) sub-slice of this
// slice. It would boil down to finding the begin and end-1 index, and
// then either returning a sub-slice with zero copy if it's in a single
// sub-slice, or returning a copy if it is compound.

// Squash can be used to "compact" the log range to a single slice.
// For example, it could be used in MemoryStorage occasionally, once
// len(e.ent) grows above a threshold and increases the cost of random
// access.
func (e Entries) Squash() Entries {
	if len(e.ent) <= 1 {
		return e
	}
	var res Entries
	res.buf[0] = e.To(make([]pb.Entry, 0, e.Len()))
	res.ent = res.buf[:1]
	return res
}

Why something like this struct is useful:

  • Avoids some unnecessary allocations. In the unstable struct, the overlapped entries on leader changes are short-lived, and will be released to the heap soon anyway. The overlapped entries are also likely still referenced because they were returned through Ready for being stored. So copying the slice (as of now) actually increases memory consumption.
  • It's possible to make this struct completely reallocation-free if we always start a new slice when the old slice is filled up.
  • It might be easier to track used memory this way, there would be no hidden refs to copies of slices.
  • Provides protected / type-safe access to a slice of log entries.
  • Can come with extra guarantees that the slices of log entries are verified, e.g. that Index is contiguous, and Term is non-decreasing.

pav-kv avatar Jun 02 '23 10:06 pav-kv

Alternatively, instead of [][]pb.Entry could just do all in one []pb.Entry, with a little “index” on a side. The "index" would be equivalent to the AppendTracker that would be needed for async storage appends.

Say, we have append [1 2 3 4 5], then an override [4 5 6 7], then append again [8, 9]. The “logical” evolution of the log is:

1 2 3 4  5
1 2 3 4' 5' 6 7
1 2 3 4' 5' 6 7 8 9

But physically it could be represented as:

1 2 3 4 5 4' 5' 6 7 8 9

And an “index”:

log index 1-3 is at array positions 0-2
log index 4-9 is at array positions 5-10

The unstable is quickly rotated away, so the backing array will be shrunken, and “index” will be usually 1-2 entries.

pav-kv avatar Jun 02 '23 11:06 pav-kv

hi @pavelkalinnikov, can I pick this up?

Aditya-Sood avatar Sep 11 '23 07:09 Aditya-Sood