accumulo Modify the DefaultCompactionPlanner to set priorities at a table name level

We currently create three compaction services for root, meta, and a default compaction service for a running instance of accumulo.

If the DefaultCompactionPlanner is modified to use table names in addition to file sizes, then a single compaction service could be used instead of creating three different compaction services.

Priority levels would be something like root:1, meta:2, accumulo*:3, user:4 where 1 is the highest priority.

This would allow a single compaction service to be used when starting up a instance of accumulo.

Dec 06 '23 22:12 ddanielr

To implement this will need to modify CompactionJobPrioritizer.java. Currently this class creates a 16 bit priority using the first bit for compaction type and the remaining 15 bits for the number of files. Could change it to use the 16 bits in the following way.

First 2 bits are table type, accumulo.root, accumulo.metadata, accumulo.*, user table
Third bit is compaction type
Remaining 13 bits are the files count. Would need to adjust the max checks to 2^13.

Dec 07 '23 15:12 keith-turner

Is there a reason to limit the size to 16 bits? Thinking that we might want more room than 2-bits for table type to allow for expansion - one example would be to include accumulo.fate (or whatever it gets called) and we could find in the future that an accumulo.hosting or something outside of the metadata that provides explicit elasticity information. Things that may want a finer level of control than being lumped into accumulo.*.

Something like byte for table type, byte for compaction type, and 2 bytes for count would fit into 32 bits and allow for growth. Limits for file counts could be limited to use less that the full 2^16 if necessary.

Dec 07 '23 15:12 EdColeman

Is there a reason to limit the size to 16 bits?

I think it was mainly driven by this map. We used to have a long for the priority, but @dlmarion and I realized that if the priority as a long was extremely high cardinality that could cause an OOME. We could not think of a good reason to have an extremely high cardinality priority so switched the long to a short.

we could find in the future that an accumulo.hosting or something outside of the metadata that provides explicit elasticity information. Things that may want a finer level of control than being lumped into accumulo.*.

The computation of the priority is done by a pluggable component, the CompactionPlanner. So other situations could be handled, they could compute the priority anyway they like.

Something like byte for table type, byte for compaction type, and 2 bytes for count would fit into 32 bits and allow for growth. Limits for file counts could be limited to use less that the full 2^16 if necessary.

I think many situations like this can be handled with the short. May need to do things like take the log or sqrt of a value you are putting in the short to compress the information while still maintaining sort order. The priority does not need to be super precise for things like file counts, just getting things in the neighborhood is usually good enough. For example if you wanted to reduce the bits for file count to increase the bits for table types, then could put log2(fileCount) in an 4 bit integer and still get good prioritization on file counts.

Dec 07 '23 16:12 keith-turner

Repointed this to elasticity because 3.1 still has the concept of internal compactions.

Jan 08 '24 21:01 ddanielr