codex-research icon indicating copy to clipboard operation
codex-research copied to clipboard

erasure coding design overview

Open dryajov opened this issue 3 years ago • 13 comments

Just a wip for now, some sections are still missing

dryajov avatar May 28 '22 18:05 dryajov

I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.

cskiraly avatar May 31 '22 12:05 cskiraly

I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.

What do you mean by the other way around? Not sure if I follow :)

dryajov avatar May 31 '22 15:05 dryajov

I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.

What do you mean by the other way around? Not sure if I follow :)

What I mean is that the original block order (1, 2, 3, etc.) goes vertically in the grid representation, forming protection and also acceleration over adjacent blocks. This with the assumption that original block order actually matters for those accessing it, i.e. it is a typical access pattern.

cskiraly avatar Jun 02 '22 00:06 cskiraly

I still think the ordering, at the block level (not at the symbol level) should be the other way around, although admittedly it only matters if one considers partial access and sequential data access patterns. In that case it allows for the parallel acceleration of sequential reads that you would not have with this ordering.

What do you mean by the other way around? Not sure if I follow :)

What I mean is that the original block order (1, 2, 3, etc.) goes vertically in the grid representation, forming protection and also acceleration over adjacent blocks. This with the assumption that original block order actually matters for those accessing it, i.e. it is a typical access pattern.

I see, yes this can make sense but it heavily depends on the placement we choose. I think this needs to be further clarified in the text and worded in such a way that it is clear that placement considerations would affect interleaving and vice-versa.

dryajov avatar Jun 02 '22 01:06 dryajov

I think we can leave it as it is for now (with the appropriate clarification), since this is how it is implemented currently, but we can change it if it becomes a bottleneck. This is one part that might also be per-dataset configurable...

dryajov avatar Jun 02 '22 01:06 dryajov

If we consider that a slot is a row (row: blocks belonging to different encoding columns to disperse erasures) then the ordering proposed by @dryajov will be beneficial for those reading sequential sections of the dataset. The other configuration (having sequential blocks 1, 2, 3.. in a column) would only provide acceleration benefits during the encoding. Assuming we encode once and read several times, the ordering proposed by @dryajov seems more beneficial to me.

leobago avatar Jun 02 '22 08:06 leobago

We may have several different layouts. F.e. let's specify one of them, a simplified one with fair distribution of download bandwidth:

User provides S*K blocks and requests K+M encoding saving data over K+M nodes (thus each node stores S blocks).

For the first K nodes, block number Y on node X = original block number Y*K+X (i.e. blocks 0, 1 .. K-1, K... goes into nodes 0, 1 .. K-1, 0...). In other words, original block number I goes into block number I/K on node I%K.

For the last M nodes, block number Y on node X is a part of recovery group consisting from blocks number Y on all K+M nodes, i.e. original blocks Y*K+i, i=0..K-1 plus M recovery blocks.

Note however that this scheme places all original data to the first K nodes and all recovery data to the last M nodes. If we want to spread recovery data fairly over all nodes, we should reshuffle data between servers in each recovery group (note that a single recovery group still occupies block Y on all K+M severs).

The simplest way to do that is to shift every next group by one position relative to previous one:

  • blocks 0 .. K-1 are mapped to servers 0 .. K-1, correspondingly
  • block K .. 2*K-1 are mapped to servers 1 .. K, correspondingly
  • block 2*K .. 3*K-1 are mapped to servers 2 .. K+1, correspondingly and so on.

Recovery blocks are shifted accordingly, filling remaining servers in each recovery group:

  • recovery blocks 0 .. M-1 (i.e. recovering data from original blocks 0..K-1) are mapped to servers K .. K+M-1, correspondingly
  • recovery blocks M .. 2*M-1 are mapped to servers K+1 .. K+M-1, 0 correspondingly
  • recovery blocks 2*M .. 3*M-1 are mapped to servers K+2 .. K+M-1, 0, 1 correspondingly

Bulat-Ziganshin avatar Jun 02 '22 23:06 Bulat-Ziganshin

That's a writeup of arguments for using "layout B", i.e. layout described in my previous message, trying to put it in the form suitable for inclusion in the document. We have improved layout compared to the previous version, now shifting each next group in the balanced dispersal by K blocks instead of 1.

Preconditions

We assume that:

  • The main source of large data losses are situations when an entire host drops (temporarily or permanently) or a host loses an entire hdd/ssd
    • Thus, each recovery group should include as much hosts as it possible without degrading other characteristics
  • The dataset, that provided by user, has some internal structure and sequential data blocks are often more closely related than more distant blocks, f.e. they may be parts of the same file or closely related files, if the dataset is an archive or a disk image
    • Thus, reading of R sequential blocks is more likely than reading of other, arbitrary chosen sets of R blocks
    • Also, if we are going to lose subset of data, it's probably better to lose R sequential blocks (i.e. one file entirely) than arbitrary R blocks (i.e. one block in each of R files)
  • The download requests should be fairly distributed among all hosts

Simple layout (reasoning and textual description)

The setting:

  • User selects recovery groups consisting of N=K+M blocks where K blocks represent the original data and M blocks are the computed ECC data
  • The entire dataset provided by user consists of S such groups
  • User selects the number of storage hosts H, and amount of data stored on the each host

We propose here layout only for the simple case of H=N hosts, each storing S blocks. More complex layouts are TBD.

Based on assumptions above, we choose the following layout:

  • In order to maximize recovery chances, we distribute each recovery group over all N hosts. This allows to recover all data even if up to M hosts are offline for any reason
  • In order to maximize sequential read throughput, we combine each group from K sequential original blocks
  • In order to evenly distribute download requests between all hosts, we evenly spread ECC blocks over hosts
  • In order to max out sequential read throughput, we ensure that if original block R is placed on the host Q, then block R+1 is always placed on host (Q+1) mod N (round-robin placement)

Simple layout (strict math definition)

The algorithm describes how user-provided data are emplaced to hosts, and ECC data are generated from them and emplaced too:

  • On input, we have S*K user-provided data blocks INPUT[I], I=0..S*K-1
  • On output, we have N nodes each storing S blocks OUTPUT[I][J], I=0..N-1, J=0..S-1

First, let's organize data into recovery groups and generate ECC:

  • GROUPED[I][J] = INPUT[I+J*K], I=0..K-1, J=0..S-1 are the original data
  • GROUPED[K .. K+M-1][J] = ECC(GROUPED[0..K-1][J]), J=0..S-1 are the computed ECC data - M recovery blocks computed from K original blocks of the same recovery group

Now, the naive block dispersal (first K nodes contains all data blocks):

  • OUTPUT[I][J] = GROUPED[I][J], I=0..N-1, J=0..S-1

And balanced, round-robin block dispersal (data and ECC blocks are evenly distributed between nodes):

  • OUTPUT[(I+K*J) mod N][J] = GROUPED[I][J], I=0..N-1, J=0..S-1

Block dispersal

Images are courtesy of @leobago

Bulat-Ziganshin avatar Jun 09 '22 03:06 Bulat-Ziganshin

@Bulat-Ziganshin looks good!

Also, if we are going to lose subset of data, it's probably better to lose R sequential blocks (i.e. one file entirely) than arbitrary R blocks (i.e. one block in each of R files)

Just to clarify, we're always treating a dataset as a single unit, this means that there isn't any "preference" in which blocks are lost, except for preventing entire columns from being lost.

Also, to clarify on the terminology, what precisely do you mean by "recovery group"? As far as I can infer, it would be what we refer to the "column" in the original writeup, i.e. an entire codeword?

dryajov avatar Jun 09 '22 13:06 dryajov

@leobago the graphic seems to be inverted, K would be the column and S would be the row, consequently, M is also stacked in the column and additional parity columns are simply appended to the original dataset, which end up in the S dimension. This doesn't change with the reshuffling proposed by @Bulat-Ziganshin.

dryajov avatar Jun 09 '22 13:06 dryajov

As the third alternative (layout "C"), @dryajov proposed "diagonale layout" which allows to append ECC blocks to the datastore, simplify indexing, and simplify adding recovery on top of pure data or replacing recovery scheme with a different one.

I will show it as an example first. Let's we have 3 groups of 5 blocks each and want to add 2 parity blocks to each group. So:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14  => Data blocks
15 16 17 18 19 20  => Parity blocks

Their layout on 5+2=7 nodes will be:

0  7 14  => Node 0
1  8 15  => Node 1
2  9 16  => Node 2
3 10 17  => Node 3
4 11 18  => Node 4
5 12 19  => Node 5
6 13 20  => Node 6

Dmitry's idea is to place recovery groups to "diagonales" of this matrix, i.e.:

0   8   16  3   11  19  6   => Recovery group 0
7   15  2   10  18  5   13  => Recovery group 1
14  1   9   17  4   12  20  => Recovery group 2

i.e. each group includes exactly one block on each line (i.e. each node), and shifts one position right on each next line.

Mathematically, this means that group #I includes exactly one block from each server #J - the block placed at the column (I+J) mod S. This block absolute number is ((I+J) mod S) * N + J.

So, the recovery group #I includes blocks with absolute numbers ((I+J) mod S) * N + J for each J in 0..N-1 (N=K+M is 7=5+2, and S=3 in this example).

Bulat-Ziganshin avatar Jun 16 '22 10:06 Bulat-Ziganshin

The "layout B" has advantage for sequential reading when some nodes lost: if we have M+K encoding, then we can receive one block per node from any M nodes and perfrom a single decoding operation to recover M sequential data blocks (because they belongs to the single recovery group). In layout A and layout C, sequential M blocks are distributed among different recovery groups.

F.e. in the picture below we can ready any 4 blocks of [1, 2, 3, 4, P1, P2] and recover first 4 blocks of the file using a single decode operation. With layout A or C, it requires to read more data and perform multiple decoding operations.

image

Bulat-Ziganshin avatar Jul 21 '22 20:07 Bulat-Ziganshin

The "layout B"

I'm not sure if I follow which layout is which at this point :). It would be helpful if we can write a short summary of each for reference so we can all follow, maybe we can even do this in a separate writeup (in discussions) and reference it here.

dryajov avatar Jul 21 '22 23:07 dryajov