accumulo No-chop merges

Related to #1050 Supersedes https://issues.apache.org/jira/browse/ACCUMULO-3235 and https://issues.apache.org/jira/browse/ACCUMULO-4118

Merges currently require a special "chop" major compaction, which can be slow. The idea is simple: in the metadata entries for files, we track the range in use for that file with a notation.

This additional metadata would be used to merge tablets without re-writing data. Only this extra metadata would be updated (to a narrower range when split, and possibly collapsed if merging adjacent ranges). If a range is missing for a file, the tablet's range will be used instead. This simplifies things for backwards compatibility, and means we don't need to include a redundant range if the tablet isn't splitting/merging.

In order to achieve this, we will need a mechanism to serialize the range information into the file field of the tablet metadata, and we will need to ensure that all reads of the file are confined to the range serialized. Some pitfalls to look out for: files opened multiple times with different ranges, use of block cache, offline file reads, handling duplicate entries for a file with different ranges as the result of a merge, garbage collection of files, etc.

Aug 15 '19 20:08 ctubbsii

This is a very involved change, but with high potential benefit.

Aug 15 '19 21:08 ctubbsii

Do you think we could do some small change for this for 2.1? I hate to start a change as complex as this now but maybe we can do something for 2.1.

May 25 '22 11:05 milleruntime

Do you think we could do some small change for this for 2.1? I hate to start a change as complex as this now but maybe we can do something for 2.1.

I don't think so. Not for 2.1. Changing this will not be small. There's too many things to consider, and I would like to release 2.1 this century :laughing: ; There's already several big feature changes that need to be wrapped up for 2.1. This should be pushed to 3.0.

May 25 '22 12:05 ctubbsii

@ctubbsii - Can you go ahead and assign this issue for me to work on? This issue looks like something interesting for me to dive into for the 3.0 release after I work on #2820 as it is also related to the metadata table.

Aug 10 '22 17:08 cshannon

I have started looking into this and doing research on how to go about implementing it. I plan to try and break this up into sub-tasks to implement smaller changes at a time. Some things of course won't be possible in isolation but I will break it up the best I can. Also, because of the scope of the issue I don't see it being ready for 3.0.0 so should we un-assign this from 3.0.0? I could see some smaller changes being done for 3.0.0 but probably not the entire feature.

Also, one question I had as I've been diving into this is will the goal here to be to ONLY support no-chop merges or is there a desire to preserve the current behavior of merges (maybe a flag or something to specify the type)?

I could see scenarios where a user would want to actually do a merge the current way (and take the performance penalty that comes with it) but I'm thinking we probably don't need to preserve the current merge strategy as an option because a user could also just simply run a compaction after a merge and accomplish the same thing if desired.

Feb 04 '23 14:02 cshannon

There is no need to retain the current behavior. Users can always compact the range afterwards if they want, and doing it in that order means it is at least available for query faster, even though the IO performance would be similar.

Feb 06 '23 19:02 ctubbsii

@ctubbsii, @keith-turner, @dlmarion - As a first step to this issue I have been looking into what it would take to create an iterator to read an Rfile that was fenced off by a range or ranges and wanted to get some feedback here on what approach to proceed with as I have come across some issues/concerns with each approach I've looked at. Below are the main 2 ideas I've looked into so far.

1. We could create a new RFile reader/iterator (for the purposes of this can just call it FencedRFileReader) that can handle multiple ranges to fence what's returned.

The idea here is the new FencedRFileReader iterator would take an existing RFile reader as the source and also a list of 1 or more ranges (or no ranges to mean whole file) and then handle transparently iterating, seeking, etc over the file and skipping rows not in a range. There are a couple ways that I thought of to do this:

One way is to have FencedRFileReader extend SeekingFilter or something similar and then internally the iterator can just handle advancing between rows and ranges by overriding getNextKeyHint(). Essentially it would keep track of the current range and then handle seeking to the next range transparently when the each range is exhausted when calling next(). It of course would need to appropriately handle the other methods as well.
Another option is is for FencedRFileReader to extend HeapIterator like the original RFile reader does. The idea I had here is to first create separate RFile readers for each range to fence off an RFile by a single range (ie RangedRFileReader). Then the new FencedRFileReader could add each single RangedRFileReader as a source and since it's a HeapIterator it should handle things automatically across multiple sources.

A problem I see with this approach of an iterator that handles more than one range is I'm not sure it would work easily because of having to handle implementing methods in FileSKVIterator such as getFirstKey() and getLastKey(). There are places in the code like CompactableUtils that use getFirstKey()/getLastKey() and rely on them for decisions and I am not sure if this would break now that there would be multiple ranges. I would need to dig into this more but maybe someone else can comment on this who knows more about the use cases for those methods. I also wonder how things like Sampling would work.

A plus side is this approach lends itself easily to still just using a single file entry per file in the metadata table and we can just extend the value to also contain a list of ranges for the file.

2. A second approach could be to only create an RFile reader/iterator that handles a single range and return a new reader for each range when using FileManager to open the list of RFiles.

So instead of having a single iterator per RFile that handles multiple ranges like described in approach 1, we would just return multiple RFile readers, one for every range specified.

The problem I see here is all of the places where we use a list of files each file is only uniquely identified by TableFile/StoredTableFile and the key is just the Path of the file. This would obviously not work in this case because we'd have the path duplicated now as there would be multiple readers for the same file. We'd need to update those classes or have a new class to also add a range as well as Path to uniquely identify the file. Updating TableFile to optionally take a Range in addition to the Path and use that for comparison/equality may be good enough here.

A plus side is methods in FileSKVIterator such as getFirstKey()/getLastKey() may be easy to implement as it's just a subset of the file and since each reader handle a single continous range hopefully code that calls those methods would work without modification (or without as much modification).

Also I was thinking this approach may be easier if there was a separate file entry in the metadata table for each range that is associated with an Rfile instead of just a single entry like today and we just update StoredTableFile to take a Range as part of the comparison/equality but we could probably still make it work with a single entry.

Thoughts/Comments?

Feb 12 '23 18:02 cshannon

My immediate curiosity is: What are we doing now for fencing off a single range for the tablet? Of the two options, I am not sure which is the smaller change, but I think it probably makes sense to keep the most minimal change that is closest to what we're already doing today. I suspect this means something closer to the first option, and the second option seems riskier/more complicated, but I do not know without digging into the code myself.

Feb 14 '23 17:02 ctubbsii

@cshannon I like option number 2 that you mentioned. Its hard to say for sure w/o trying it, but that one feels like it might be the best. I looked around the code and MultiIterator seems like it would be part of the solution. Interestingly MultiIterator does the fencing and combines multiple iterators. We may want to create a standalone fencing iterator for efficiency.

Not sure where this would go in the code exactly, but maybe we could do something like the following for an rfile iterator with a list of ranges. This could be one way to implement option 2.

  SortedKeyValueIterator<Key,Value> fenceIterator(SortedKeyValueIterator<Key,Value> iter, List<Ranges> ranges) {
    if(ranges.size() == 0){
      return iter;
    } else if (ranges.size() == 1) {
      return new FencedIterator(iter, ranges.get(0));
    } else {
      var fencedIters = ranges.stream().map(range -> new FencedIterator(iter.deepCopy(), range)).collect(Collectors.toList());
      return new MultiIterator(fencedIters, false);
    }
  }

Feb 26 '23 00:02 keith-turner

Thanks @keith-turner and @ctubbsii for the feedback. I didn't get any time last week but I should have a chance later this week to poke around a bit and try some prototyping on the multi range Rfile iterator.

Mar 01 '23 23:03 cshannon

I bumped this to 3.1.0 - it is a new capability and may fit better there rather than a deprecation clean-up release (3.0.0). If this is ready before the 3.0.0 is released, we can move it back if appropriate.

Mar 08 '23 19:03 EdColeman

Just a quick update, I've been working on a prototype the last couple weeks (trying various different things so it is taking a while) and I should have something by next week to push up for a draft PR to get some feedback on. I have a basic iterator currently working with a few tests but still needs work and I'm sure has lots of issue (and not everything is implemented yet). Also while working on testing one thing I noticed was the index probably needs fencing as well so I'm need to work on figuring out is the best way to fence off the index and handle operations like getLastKey(), getFirstKey() etc.

Mar 11 '23 19:03 cshannon

and handle operations like getLastKey(), getFirstKey() etc.

@cshannon where are you seeing those methods? I was looking around in the code and saw them in FileSKVIterator.

Mar 13 '23 02:03 keith-turner

@keith-turner - Right, that's the interface where they are and the current RFile reader iterator implements that interface so we would need to implement this or make sure the code that calls the iterator that handles fencing doesn't care about the methods/interface. If we only care about the start/end of a file it should not be a big deal but I need to dig into it more. For the draft PR they can stay unimplemented either way.

I was looking at those methods because I was thinking we could/should try and implement that interface as well for the fenced reader to make it easy to replace the iterator where it's used in the code but maybe we don't do that (it may not make sense) or we just don't implement certain methods (unsupported methods seem pretty common). I'll have to play around with it a bit more.

This should be a lot easier to discuss once I push up what I have.

Mar 13 '23 10:03 cshannon

I've started looking at how to persist ranges to the file field in the metadata table. The main thing I've noticed digging into it so far is that I will need to find a good way to persist the ranges as a String. It looks like all the values in the Metadata table are converting to string representations and encoded as UTF-8 and right now a Range encode only supports just bytes. This should be doable of course I will just need to write the logic to do it.

Currently DataFileValue contains size, number of entries and optional time field all comma separated and encoded as a String. My current plan is to encode the collection or Ranges for a file and append it as the 4th item in the value. In order to be backwards compatible the DataFileValue logic that decodes will check if there are 4 items to parse when splitting on the comma and if there's a 4th item it will know that range information exists. Right now time is optional and is not encoded if -1 (not set) so going forward we will have to write it as -1 if not set but range information is set so that we can parse properly. Obviously if no range information exists then this just means the entire file is valid so we don't need to have a range or the range will just be a range with infinite start/stop so will be backwards compatible.

Mar 19 '23 16:03 cshannon

@ctubbsii and @keith-turner - In regards to my recent comment about storing ranges to the file field, how do you feel about encoding a list of ranges using JSON or maybe Base64? A list of ranges will be fairly complex with the different fields so to make creating a string representation easier I was thinking of using JSON or just encoding using the existing binary logic and then encoding that as Base64.

Mar 19 '23 16:03 cshannon

encoding a list of ranges using JSON

:+1:

Mar 20 '23 14:03 dlmarion

I would favor using JSON, and converting the old binary fields to also use the same JSON, so we're not mixing and matching formats. Any binary data would necessarily need to be Base64 encoded in order to be placed in the JSON blob, since JSON doesn't support binary data directly.

I'm curious how you intend to handle the situation where ranges are contiguous or overlapping, whether you'd want to try to coalesce these during a merge to reduce the number of ranges, or leave them as-is.

Mar 20 '23 20:03 ctubbsii

Thanks for the feedback, I'll start exploring the JSON option and see how it goes. I may have some follow up questions on how to handle things as i start working on it but that's a good starting point.

I was also thinking it would be nice to just write the entire DataFieldValue as as JSON object (so the 3 existing fields plus the new list of Ranges) but then it wouldn't be backwards compatible so you'd need some new way to detect the new format and code to handle the old and new ways of storing the data which adds complexity and could be tricky so not sure if that's the best thing to do or instead only serialize the Ranges as a JSON encoded String and appending as the 4th item in the comma separated representation of DataFieldValue.

In terms of contiguous/overlapping ranges I plan to collapse them when storing to reduce the ranges. It makes sense to keep as few as possible as it should improve the performance as there will be less ranges to track/manage and reduces the data that is stored in metadata. There's already a very nice method called mergeOverlapping in Range to handle this and I'm using it in my PR for the ranged File Reader when constructing the iterator for fencing. We may actually be able to remove that call from the ranged reader if we use it when storing in metadata as it would in theory be redundant as the stored ranges should already be merged together and not need to be checked again.

Mar 20 '23 21:03 cshannon

you'd need some new way to detect the new format and code to handle the old and new ways of storing the data which adds complexity and could be tricky

We can also just rewrite the metadata on upgrade.

or instead only serialize the Ranges as a JSON encoded String and appending as the 4th item

Before the discussion about JSON, that's close to what I was expecting... some kind of binary representation as another item in the list. With the discussion of JSON, I think it's less confusing to not mix-and-match the list and the JSON. Just pick one or the other. I think either way, you're going to have to be able to detect the old and the new format, or have an upgrade step.

Mar 23 '23 16:03 ctubbsii

My initial plan was to just append the ranges as binary as I figured that would be simplest since the encoder for a Range is currently binary. But when I started to look more into it last week I started looking at Ample and I noticed that all values are actually encoded as Strings for the different columns. The current pattern is that TableMutatorBase will convert/encode everything as a String value if it's not already a string before adding the mutation. External compaction metadata actually use JSON to encode already so that pattern exists too.

So, because of the fact that the established pattern is to store everything in TabletMetadata as String encoded values and because there's already precedence for using JSON that's why I brought up that option.

I agree it is less confusing to not mix/match so I plan to just follow your suggestion and encode the entire DataFieldValue as JSON and then either detect the legacy vs new format or just update the values on upgrade.

Mar 24 '23 11:03 cshannon

I talked to @keith-turner about this quite a bit today and we came up with a bit of an alternative strategy to what I was trying with my original two draft PRs (#3246 and #3286) where I was trying to handle multiple ranges per file with fencing and still just storing a single file metadata entry per RFile.

After talking through everything I am going to try the following in one or more new PRs to handle both the reading/fencing case and then the storing of metadata ranges.

After going through the scenarios with how fencing off rfiles might be used with merges, splits, scans, etc we think it might be better to go with treating each range as its own file. (Basically a variation of option 1 I detailed in my post here). The idea being that if we can treat each range as its own file the rest of the code wouldn't need as much modification as it's still just dealing with file abstractions.
We would only need to create a Fenced Rfile iterator to handle a single range (wouldn't need an iterator to handle multiple ranges anymore). It's to be determined if the fencing iterator can just implement SortedKeyValueIterator or needs to also implement FileSKVIterator. We may also need to fence the index as well.
For storing files and ranges in the metadata table (DataFileValue) we realized that it may be better to associate a file metadata entry per range and not try and store multiple ranges for a single file entry. This should work better because after thinking about how the the metadata is used for splits, etc we realized that the current DataFileValue fields of size, numEntries, and time really should be associated per Range and not per file. To accomplish this we think it could work to change the DataFile column qualifier (StoredTabletFile) to also include an optional range instead of just the URI to make it unique per range so you'd end up with 1 to many entries per file stored (just 1 entry still if no range or range that covers the entire file).
The code (file operations, scans, etc) that deal with StoredTableFiles would hopefully not need a lot of modification if we can encapsulate the fencing and range handling in the iterator and encapsulate the range in StoredTabletFile so that they are just treated like normal files. In other words (for example) if we do it right hopefully the code that iterates/scans over 10 unique files vs 10 "files" that are really just 10 unique ranges of 1 file would be identical as the code scanning wouldn't know or care about the difference.

Anyways, I'm going to work on it and see how it goes. It may not work as well in practice or could run into some roadblocks but if it works it could make things a lot cleaner. As I said, I'll do the work in a new PR(s) and keep the current ones open so we can compare the difference.

Apr 14 '23 20:04 cshannon

Very nice summary @cshannon. One additional thing we realized is that having a virtual file per range will help spur compaction when there are lots of ranges for a single file. For example if a tablet has 30 ranges for a singe file, when treating each range as a file then the compaction algorithm would probably compact this and prioritize it (by default compactions prioritize jobs with more files). This is nice because we would probably want a single file with lots of ranges to compact, and with each range being treated as a file this will cause that to happen w/o specialized code. In addition to avoiding specialized code for the compaction case, it also avoids specialized config that would go with the specialized code. The existing compaction ratio config will work nicely with virtual files per range.

Apr 14 '23 22:04 keith-turner

accumulo accumulo copied to clipboard

No-chop merges

1. We could create a new RFile reader/iterator (for the purposes of this can just call it FencedRFileReader) that can handle multiple ranges to fence what's returned.

2. A second approach could be to only create an RFile reader/iterator that handles a single range and return a new reader for each range when using FileManager to open the list of RFiles.

accumulo
accumulo copied to clipboard