codex-research
codex-research copied to clipboard
erasure coding design overview
Just a wip for now, some sections are still missing
I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.
I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.
What do you mean by the other way around? Not sure if I follow :)
I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.
What do you mean by the other way around? Not sure if I follow :)
What I mean is that the original block order (1, 2, 3, etc.) goes vertically in the grid representation, forming protection and also acceleration over adjacent blocks. This with the assumption that original block order actually matters for those accessing it, i.e. it is a typical access pattern.
I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.
What do you mean by the other way around? Not sure if I follow :)
What I mean is that the original block order (1, 2, 3, etc.) goes vertically in the grid representation, forming protection and also acceleration over adjacent blocks. This with the assumption that original block order actually matters for those accessing it, i.e. it is a typical access pattern.
I see, yes this can make sense but it heavily depends on the placement we choose. I think this needs to be further clarified in the text and worded in such a way that it is clear that placement considerations would affect interleaving and vice-versa.
I think we can leave it as it is for now (with the appropriate clarification), since this is how it is implemented currently, but we can change it if it becomes a bottleneck. This is one part that might also be per-dataset configurable...
If we consider that a slot is a row (row: blocks belonging to different encoding columns to disperse erasures) then the ordering proposed by @dryajov will be beneficial for those reading sequential sections of the dataset. The other configuration (having sequential blocks 1, 2, 3.. in a column) would only provide acceleration benefits during the encoding. Assuming we encode once and read several times, the ordering proposed by @dryajov seems more beneficial to me.
We may have several different layouts. F.e. let's specify one of them, a simplified one with fair distribution of download bandwidth:
User provides S*K blocks and requests K+M encoding saving data over K+M nodes (thus each node stores S blocks).
For the first K nodes, block number Y on node X = original block number Y*K+X (i.e. blocks 0, 1 .. K-1, K... goes into nodes 0, 1 .. K-1, 0...). In other words, original block number I goes into block number I/K on node I%K.
For the last M nodes, block number Y on node X is a part of recovery group consisting from blocks number Y on all K+M nodes, i.e. original blocks Y*K+i, i=0..K-1 plus M recovery blocks.
Note however that this scheme places all original data to the first K nodes and all recovery data to the last M nodes. If we want to spread recovery data fairly over all nodes, we should reshuffle data between servers in each recovery group (note that a single recovery group still occupies block Y on all K+M severs).
The simplest way to do that is to shift every next group by one position relative to previous one:
- blocks
0 .. K-1are mapped to servers0 .. K-1, correspondingly - block
K .. 2*K-1are mapped to servers1 .. K, correspondingly - block
2*K .. 3*K-1are mapped to servers2 .. K+1, correspondingly and so on.
Recovery blocks are shifted accordingly, filling remaining servers in each recovery group:
- recovery blocks
0 .. M-1(i.e. recovering data from original blocks 0..K-1) are mapped to serversK .. K+M-1, correspondingly - recovery blocks
M .. 2*M-1are mapped to serversK+1 .. K+M-1, 0correspondingly - recovery blocks
2*M .. 3*M-1are mapped to serversK+2 .. K+M-1, 0, 1correspondingly
That's a writeup of arguments for using "layout B", i.e. layout described in my previous message, trying to put it in the form suitable for inclusion in the document. We have improved layout compared to the previous version, now shifting each next group in the balanced dispersal by K blocks instead of 1.
Preconditions
We assume that:
- The main source of large data losses are situations when an entire host drops (temporarily or permanently) or a host loses an entire hdd/ssd
- Thus, each recovery group should include as much hosts as it possible without degrading other characteristics
- The dataset, that provided by user, has some internal structure and sequential data blocks are often more closely related than more distant blocks, f.e. they may be parts of the same file or closely related files, if the dataset is an archive or a disk image
- Thus, reading of R sequential blocks is more likely than reading of other, arbitrary chosen sets of R blocks
- Also, if we are going to lose subset of data, it's probably better to lose R sequential blocks (i.e. one file entirely) than arbitrary R blocks (i.e. one block in each of R files)
- The download requests should be fairly distributed among all hosts
Simple layout (reasoning and textual description)
The setting:
- User selects recovery groups consisting of N=K+M blocks where K blocks represent the original data and M blocks are the computed ECC data
- The entire dataset provided by user consists of S such groups
- User selects the number of storage hosts H, and amount of data stored on the each host
We propose here layout only for the simple case of H=N hosts, each storing S blocks. More complex layouts are TBD.
Based on assumptions above, we choose the following layout:
- In order to maximize recovery chances, we distribute each recovery group over all N hosts. This allows to recover all data even if up to M hosts are offline for any reason
- In order to maximize sequential read throughput, we combine each group from K sequential original blocks
- In order to evenly distribute download requests between all hosts, we evenly spread ECC blocks over hosts
- In order to max out sequential read throughput, we ensure that if original block R is placed on the host Q, then block R+1 is always placed on host
(Q+1) mod N(round-robin placement)
Simple layout (strict math definition)
The algorithm describes how user-provided data are emplaced to hosts, and ECC data are generated from them and emplaced too:
- On input, we have
S*Kuser-provided data blocksINPUT[I], I=0..S*K-1 - On output, we have N nodes each storing S blocks
OUTPUT[I][J], I=0..N-1, J=0..S-1
First, let's organize data into recovery groups and generate ECC:
GROUPED[I][J] = INPUT[I+J*K], I=0..K-1, J=0..S-1are the original dataGROUPED[K .. K+M-1][J] = ECC(GROUPED[0..K-1][J]), J=0..S-1are the computed ECC data - M recovery blocks computed from K original blocks of the same recovery group
Now, the naive block dispersal (first K nodes contains all data blocks):
OUTPUT[I][J] = GROUPED[I][J], I=0..N-1, J=0..S-1
And balanced, round-robin block dispersal (data and ECC blocks are evenly distributed between nodes):
OUTPUT[(I+K*J) mod N][J] = GROUPED[I][J], I=0..N-1, J=0..S-1

Images are courtesy of @leobago
@Bulat-Ziganshin looks good!
Also, if we are going to lose subset of data, it's probably better to lose R sequential blocks (i.e. one file entirely) than arbitrary R blocks (i.e. one block in each of R files)
Just to clarify, we're always treating a dataset as a single unit, this means that there isn't any "preference" in which blocks are lost, except for preventing entire columns from being lost.
Also, to clarify on the terminology, what precisely do you mean by "recovery group"? As far as I can infer, it would be what we refer to the "column" in the original writeup, i.e. an entire codeword?
@leobago the graphic seems to be inverted, K would be the column and S would be the row, consequently, M is also stacked in the column and additional parity columns are simply appended to the original dataset, which end up in the S dimension. This doesn't change with the reshuffling proposed by @Bulat-Ziganshin.
As the third alternative (layout "C"), @dryajov proposed "diagonale layout" which allows to append ECC blocks to the datastore, simplify indexing, and simplify adding recovery on top of pure data or replacing recovery scheme with a different one.
I will show it as an example first. Let's we have 3 groups of 5 blocks each and want to add 2 parity blocks to each group. So:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 => Data blocks
15 16 17 18 19 20 => Parity blocks
Their layout on 5+2=7 nodes will be:
0 7 14 => Node 0
1 8 15 => Node 1
2 9 16 => Node 2
3 10 17 => Node 3
4 11 18 => Node 4
5 12 19 => Node 5
6 13 20 => Node 6
Dmitry's idea is to place recovery groups to "diagonales" of this matrix, i.e.:
0 8 16 3 11 19 6 => Recovery group 0
7 15 2 10 18 5 13 => Recovery group 1
14 1 9 17 4 12 20 => Recovery group 2
i.e. each group includes exactly one block on each line (i.e. each node), and shifts one position right on each next line.
Mathematically, this means that group #I includes exactly one block from each server #J - the block placed at the column (I+J) mod S. This block absolute number is ((I+J) mod S) * N + J.
So, the recovery group #I includes blocks with absolute numbers ((I+J) mod S) * N + J for each J in 0..N-1 (N=K+M is 7=5+2, and S=3 in this example).
The "layout B" has advantage for sequential reading when some nodes lost: if we have M+K encoding, then we can receive one block per node from any M nodes and perfrom a single decoding operation to recover M sequential data blocks (because they belongs to the single recovery group). In layout A and layout C, sequential M blocks are distributed among different recovery groups.
F.e. in the picture below we can ready any 4 blocks of [1, 2, 3, 4, P1, P2] and recover first 4 blocks of the file using a single decode operation. With layout A or C, it requires to read more data and perform multiple decoding operations.

The "layout B"
I'm not sure if I follow which layout is which at this point :). It would be helpful if we can write a short summary of each for reference so we can all follow, maybe we can even do this in a separate writeup (in discussions) and reference it here.