raft icon indicating copy to clipboard operation
raft copied to clipboard

Release in-memory log space based on entries size

Open pav-kv opened this issue 1 year ago • 8 comments

When entries exit the unstable log structure https://github.com/etcd-io/raft/blob/3e6cb625f59ca9da34d03df484217d874340e794/log_unstable.go#L156

it shrinks itself based on the entries slice length and cap: https://github.com/etcd-io/raft/blob/3e6cb625f59ca9da34d03df484217d874340e794/log_unstable.go#L162-L179

Firstly, the len(u.entries)*lenMultiple < cap(u.entries) does not necessarily do a right thing. After shifting the slice start in u.entries = u.entries[num:], the first num entries "disappear" from len and cap of this slice: https://go.dev/play/p/ao1xaL0Edla.

So it's possible that len >= cap/2 all the time, even though only a small portion of the backing slice is used. We should take the cap of the entire backing slice into account instead.

Secondly, this heuristic only takes len/cap of the slice into consideration, but not the size of the entries referenced by it. It could be beneficial to, in addition, shrink the slice if the sum size of the "garbage" entries is more than a half. This would keep the overall memory consumption more controlled. Doing so would require maintaining a running sum of entry sizes in the unstable struct (which we do anyway in other places for rate limiting purposes).

The same heuristic could be applied in MemoryStorage.Compact method to smooth out the allocation/copying cost of log truncations. Frequent truncations may incur a quadratic cost here, while the heuristic allows capping it at O(N).

pav-kv avatar Jun 02 '23 12:06 pav-kv

Alternatively, prove that the current strategy in unstable can't result in holding more memory than, say, 2x of a configured max.

pav-kv avatar Jun 02 '23 13:06 pav-kv

But it doesn't look like we're safe as is. Say, the max size is 64 MB. Consider the following sequence of appends / truncations (numbers are entry sizes in MB):

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 8 8 8 8 8 8 16 16 32
<-------------------------------> ....................................
                                  <-------------> ....................
                                                  <---------> ........
                                                              <---> ..
                                                                    <>

First, 17 entries of size 2MB are appended, total is 34 MB (below the limit). The slice len == 17, cap == 32.

Then, a truncate 2MB, and a sequence of 8 (append 4MB, truncate 2 x 2MB) follows. During this sequence, len * 2 >= cap holds, so we're not reallocating the array. In the end slice len == 8, cap == 16.

Then, a sequence of 4 (append 8MB, truncate 2 x 4MB) follows. Same thing.

Eventually, we fill up the cap of the array, by adding a final 32MB entry.

The total memory usage of this slice by the time we reallocate is ~ 32MB * 5. More generally, an example can be constructed with O(B log B) bytes usage, for a configured max size B. We would like to reduce this to O(B), something like 2 * B.

pav-kv avatar Jun 02 '23 13:06 pav-kv

I am taking a look at this issue.

However, from what I've learned, I don't think there is a way to get the length of entire backing array of a slice from the slice itself. That means we will also have to maintain a backingArrayLength in the unstable struct, and update manually it everywhere we assign a value to entries.

e.g. if we assign entries = nil then we set backingArrayLength = 0, and if we set u.entries = append(u.entries, ents...), then we will have to check if the array is reallocated; if it is, then backingArrayLength = cap(u.entries), else we leave it unchanged.

We need to do the same thing to calculate the accurate sum size of the "garbage" entries.

I wonder if I am correct and if this approach is expected.

BTW, there's a typo: there are 6 * 8MB entries in the figure. I think there should be only 4 of them.

CaojiamingAlan avatar Jun 07 '23 21:06 CaojiamingAlan

I didn't see https://github.com/etcd-io/raft/issues/69. I think this issue would be easy to solve if we get https://github.com/etcd-io/raft/issues/69 done.

CaojiamingAlan avatar Jun 07 '23 21:06 CaojiamingAlan

hi @pavelkalinnikov @CaojiamingAlan is this issue still open for contribution?

Aditya-Sood avatar Aug 11 '23 06:08 Aditya-Sood

I wonder if I am correct and if this approach is expected.

@CaojiamingAlan Yes, I think you're correct, some kind of backingArrayLength / size tracking is needed to workaround. The backingArrayLength doesn't have to be baked into unstable type, you could think of the backing array as a separate "deque" kind of data structure and implement/test it separately to not overload unstable with the details.

is this issue still open for contribution?

@Aditya-Sood Still open, would you like to take a stab at it? @CaojiamingAlan Are you working on this?

pav-kv avatar Aug 11 '23 09:08 pav-kv

@pavelkalinnikov yes I'd like to try

if @CaojiamingAlan is unable to reply by Sunday I'll get started

Aditya-Sood avatar Aug 12 '23 05:08 Aditya-Sood

hi @pavelkalinnikov, I'll get started on this could you assign the issue to me?

Aditya-Sood avatar Aug 14 '23 08:08 Aditya-Sood