fdb-record-layer icon indicating copy to clipboard operation
fdb-record-layer copied to clipboard

Consider replacing single record range deletes with multiple single key deletes

Open alecgrieser opened this issue 3 years ago • 0 comments

At the moment, when a single record is deleted, we (typically) will end up issuing one range delete that covers the whole record. See this implementation in KeyValueUnsplitter:

https://github.com/FoundationDB/fdb-record-layer/blob/ceaaeb538d3328ad6e3b32f03c34c3764a0195b6/fdb-record-layer-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/SplitHelper.java#L239-L250

Note that the previousSizeInfo contains enough information to know how many keys (and which ones) were used to store the previous record, which also means that it knows enough to reconstruct which split points existed for the record in many cases. However, even if that information is known, the delete logic still issues one range delete instead of one delete for each split point.

The current approach has some advantages over issuing multiple single-key deletes including:

  1. Simplicity. Because we know all of the keys for the record are stored under a single prefix, it's easier to be confident in this logic than in making sure we have correctly considered all subtleties around the way that split records work
  2. Fewer JNI calls. Each clear would be its own JNI hop, and so a single clear decreases the number of these
  3. Less data in the final commit. Each delete ends up in the final commit as its own mutation set, and so doing fewer clears means that the final commit message can be smaller
  4. Less work in the resolver. The one range delete shows up in the FDB transaction resolver as a single mutation range rather than as one range for each key

However, single-key deletes have the following advantage: it is easier for the FDB storage engine to optimize single key deletes. With the current FDB ssd engine (based on sqlite), the difference is usually fairly small: range deletes are inserted into the B-tree, either deleting the relevant keys or inserting a range tombstone into the B-tree page over the relevant range that will be "lazily deleted" (potentially recursively). For single-record clears, the range should typically be over one or more keys on a single leaf page in the B-tree and thus shouldn't need a range tombstone at all, though it may require one in some circumstances. (Similar analysis applies to the Redwood storage engine in development.)

For the in-development RocksDB storage engine (or LSM storage engines more generally), the story is a little different, as all range clears would necessitate either iterating over the range and issuing single key clears for each read key or inserting a range tombstone. At read time, the range tombstones need to be consulted and applied to avoid reading extraneous pages, and the range tombstones also need to be merged in to during compactions. That can be fairly CPU intensive, and so minimizing the number of range tombstones would be preferable. In theory, there are some FDB server optimizations that could be done to support this (if the storage engine could somehow try and perform the delete key-by-key more aggressively at least in some circumstances). There are also some FDB client fixes to this (e.g., for any range delete that intersects with a read range in the read-your-write cache, the FDB client could automatically translate the range delete into a sequence of single key deletes). Finally, the Record Layer could modify some of its delete paths to try and decrease the number of range deletes it issues, the most avoidable of which being single record deletes.

Note: at the FDB protocol level, there actually isn't a "single key clear" and all clears are range clears. The "clear key" operation is translated into a range clear over a single key range (i.e., to clear a single key, the range starting at that key and ending at that key plus \0 is cleared) at the protocol level. So, for any storage engine to somehow optimize the single key range clear case, it would need to either change the protocol to introduce a separate ClearKey mutation type or translate the range back into its original single key.

As a note: one slight optimization to naïvely issuing a series of single-key clears would be to issue: (1) one write conflict range over the whole record and (2) one clear for each key. This would decrease the commit size over the naïve approach (as the final commit would need to have one mutation range for each key and one conflict range for each record instead of one mutation range and one conflict range for each record) though it would still be bigger than the current delete size, and it would mean that identical data would be sent to the resolver. However, it does come at the cost of one additional JNI hop over the naïve approach.

alecgrieser avatar Dec 14 '21 00:12 alecgrieser