drill
drill copied to clipboard
DRILL-6806: Moving code for a HashAgg partition into separate class.
Goal
The goal of this change is to move code for managing a partition in HashAgg into a seperate class, similar to the HashPartition class in HashJoinBatch. This has several advantages:
- Ease of testing: We can test basic partition logic like spilling independently of the rest of the HashAgg operator.
- Reducing generated code: Currently HashAggTemplate contains the bulk of the code for HashAgg and it is a generated class. The larger the HashAggTemplate class is, the more space that the generated code for HashAgg consumes. Additionally generated code is more difficult to debug during testing, so moving code out of HashAggTemplate into a non-generated class simplifies debugging.
- Memory Calculations: Moving the code for a partition into a separate class enables us to simplify and unit test the memory calculations in HashAgg.
The original design doc describing the motivation behind the change and how it relates to memory calculations is here.
Implementation HashAggPartition
In order to achieve this I introduced some interfaces:
- HashAggPartition: This is the interface for a partition. It is useful to provide an interface for the partition so that we can test other pieces of operator logic with a mock partition.
- HashAggMemoryCalculator: This currently encapsulates the logic for reserving and releasing space space in HashAgg's allocator at various points in the operator's execution. This interface was created so that the HashAggPartition could rely on a clear calculator interface that could be swapped out with different implementations for testing.
- HashAggBatchAllocator: This interface includes the method that is used to allocate space for an outgoing batch. This interface is implemented by HashAggTemplate and is required by the HashAggPartition class. The interface is necessary because HashAggTemplate is generated at runtime, while the HashAggPartition class is not generated. So HashAggPartition cannot use methods implemented by HashAggTemplate unless they are included in an interface.
- HashAggBatchHolder: The BatchHolder class is an inner class of HashAggTemplate and is generated. Since the code for HashAggPartition is not generated it can only interact with a BatchHolder through an interface. So the HashAggBatchHolder interface was added.
Then the relevant code for HashAggPartition, and HashAggMemoryCalculator calculator were moved out of HashAggTemplate and into implementation classes HashAggPartitionImpl and HashAggMemoryCalculatorImpl.
The rest of the code changes focus on replacing the usages of the following arrays with a single HashAggPartition array in HashAggTemplate:
- hashTables
- batchHolders
- outBatchIndex
- writers
- spilledBatchesCount
- spillFiles
Next Steps
This change is a first step to moving the code for a partition into a separate HashAggPartition class. I did not move all of it in this change in order to make review easier and to minimize the risk for bugs. A follow up change will complete moving all the code relevant to a partition into HashAggPartition. The main body of code that remains to be moved is most of the HashAggTemplate.checkGroupAndAggrValues function.
Additionally some debugging statements which printed out various partition stats were removed in this change. After the code for HashAggTemplate.checkGroupAndAggrValues is moved into HashAggPartition, I will make the stats available to print for debugging on the HashAggPartition class itself.
@Ben-Zvi Please review. This also fixes the code generation errors that I introduced in a previous commit.
@Ben-Zvi Added description of the changes
@ilooner Could we proceed with this PR?
@vdiravka still need to address @Ben-Zvi 's comments. I will try to get it done early next week.
@vdiravka Got overloaded with some deadlines for another project. I won't be able to finish this for this release.