noisepage Fix Compiled Parallel Aggregations

Closes #1099.

Parallel static aggregations fail in compiled mode. A simple procedure to reproduce follows:

CREATE TABLE test (a INT, b INT);
INSERT INTO test (a , b) VALUES (1, 2);
SELECT SUM(a), SUM(b) FROM test;

Executing the final query produces the following error messages:

[2020-09-17 18:31:34.843] [execution_logger] [error] ERROR IN MODULE:
Stored value type does not match pointer operand type!
  store %1* %16, %"class.terrier::execution::sql::IntegerSumAggregate"** %5
 %"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
  store %1* %22, %"class.terrier::execution::sql::IntegerSumAggregate"** %8
 %"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
  store %1* %5, %"class.terrier::execution::sql::IntegerSumAggregate"** %2
 %"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
  store %1* %7, %"class.terrier::execution::sql::IntegerSumAggregate"** %3
 %"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
  store %1* %31, %"class.terrier::execution::sql::IntegerSumAggregate"** %10
 %"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
  store %1* %35, %"class.terrier::execution::sql::IntegerSumAggregate"** %12
 %"class.terrier::execution::sql::IntegerSumAggregate"*

The majority of this error message is produced by the Verifier::visitStoreInst() function in Verifier.cpp within LLVM. As the message suggests, we generate IR with a type mismatch between the stored value and the location (address) at which it is stored. The (simplified) syntax for an LLVM IR store looks like: store <ty> <value> <ty>* <pointer> and the LLVM type system (enforced in part by Verifier.cpp) requires that "the type of the operand must be a pointer to the first class type of the operand".

The issue arises in the MergeAggregates function. We produce the following bytecode for MergeAggregates:

Function 0 <MergeAggregates>:
  Frame size 48 bytes (2 parameters, 6 locals)
    param     queryState:  offset=0       size=8       align=8       type=*struct{*ExecutionContext,struct{IntegerSumAggregate,IntegerSumAggregate}}
    param  pipelineState:  offset=8       size=8       align=8       type=*struct{struct{IntegerSumAggregate,IntegerSumAggregate}}
    local           tmp1:  offset=16      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp2:  offset=24      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp3:  offset=32      size=8       align=8       type=*IntegerSumAggregate
    local           tmp4:  offset=40      size=8       align=8       type=*IntegerSumAggregate

  0x00000000    Lea                                             local=&tmp1  local=queryState  i32=8
  0x00000010    IntegerSumAggregateMerge                        local=tmp1  local=pipelineState
  0x0000001c    Lea                                             local=&tmp2  local=queryState  i32=8
  0x0000002c    Lea                                             local=&tmp3  local=tmp2  i32=16
  0x0000003c    Lea                                             local=&tmp4  local=pipelineState  i32=16
  0x0000004c    IntegerSumAggregateMerge                        local=tmp3  local=tmp4
  0x00000058    Return

The error occurs at the final Lea bytecode: Lea &tmp4, pipelineState, 16. Within the LLVM engine, the Lea TPL instruction is implemented in terms of a store in LLVM IR.

We have that the pipelineState type looks like:

struct {
    struct {
        IntegerSumAggregate,
        IntegerSumAggregate
    }
}

While the type of local tmp4 is simply *IntegerSumAggregate.

To construct the store instruction from the bytecode, we perform the following procedure:

Get the offset from the 2nd (0-indexed) operand to the Lea instruction (16)
Compute the element within pipelineState that contains this offset; pipelineState is a structure with only a single member (itself a structure) so the result of this computation is the sole member of pipelineState with type: struct {IntegerSumAggregate, IntegerSumAggregate}
Compute the index of this element within the top level structure (pipelineState); this calculation results in 0 as the index
Compute a pointer to the element at the index computed above; this results in a pointer with type *struct {IntegerSumAggregate, IntegerSumAggregate}
Create a store instruction that stores this computed pointer to &tmp4

The resulting store IR therefore looks like (in pseudo-IR):

store *struct{IntegerSumAggregate, IntegerSumAggregate} val, **IntegerSumAggregate ptr

Verifier::visitStoreInst() contains an assertion that verifies (essentially) the following type constraint:

typeof(*ptr) == typeof(val)

We fail this assertion because we attempt to store a pointer to the structure struct {IntegerSumAggregate, IntegerSumAggregate} to a location that expects a pointer to a lone IntegerSumAggregate.

Sep 29 '20 14:09 turingcompl33t

Some preliminary results from the benchmarks supported by OLTPBench. About to run through TPC-H benchmarks with the setup suggested by @thepinetree.

Benchmarks run on Ubuntu 20.04 server instances (c5.9xlarge) on AWS with the following specs:

18 X 3404.27 MHz CPUs
L1 Data 32 KiB (x18)
L1 Instruction 32 KiB (x18)
L2 Unified 1024 KiB (x18)
L3 Unified 25344 KiB (x1)

Results for the benchmarks are averaged over 5 runs.

Average Bytecode Count

Benchmark	Master	Optimization Removed
TPC-C	124	141
Smallbank	58	67
TATP	197	227

Benchmarks (Interpreted)

Benchmark	Master	Optimization Removed
TPC-C	337 tps	337 tps
Smallbank	595 tps	590 tps
TATP	442 tps	426 tps

Benchmarks (Compiled)

Benchmark	Master	Optimization Removed
TPC-C	337 tps	327 tps
Smallbank	594 tps	588 tps
TATP	545 tps	537 tps

Jan 21 '21 01:01 turingcompl33t

Results from running TPC-H benchmarks comparing the current state of master and this PR branch. I ran these benchmarks using the "custom" TPC-H setup found in the latest commit on @thepinetree's branch. Scale factor is 1. Each query is executed a minimum of 100 times in each benchmark. Runtimes are averaged over 10 runs; the standard deviation among individual measurements is included following the average runtime (in parentheses).

Benchmarks run on Ubuntu 20.04 server instances (c5.9xlarge) on AWS with the following specs:

18 X 3404.27 MHz CPUs
L1 Data 32 KiB (x18)
L1 Instruction 32 KiB (x18)
L2 Unified 1024 KiB (x18)
L3 Unified 25344 KiB (x1)

Interpreted

Master	PR	Diff
77392ms (759)	77182ms (570)	-0.27%

Compiled

Master	PR	Diff
56000ms (42)	55961ms (87)	-0.07%

In both interpreted and compiled execution modes, we see a very slight decrease in the overall runtime of the benchmark. Obviously it makes no sense that adding additional byetcode to be interpreted / compiled makes the system run faster; I think the takeaway from these results is: the overhead of introducing the additional bytecode (rather, not eliding the Lea in circumstances in which it is possible) has such a negligible impact on overall performance that the difference is lost in the noise of the system, despite the fact that Google Benchmark works hard to give us reproducible results and all of the benchmarks were run multiple times independently.

Interested to hear what other's takes may be. I also think it may be worthwhile at this point to mark this "ready for CI" to determine if perfguard uncovers anything different, although I suppose it has also been showing spurious changes in benchmark performance, so it is difficult to arrive at a definitive answer.

Jan 21 '21 16:01 turingcompl33t

The other thing to consider here is the fact that, currently, we don't have any automatic testing mechanism to verify this functionality. For instance, we can now run queries like

CREATE TABLE test (a INT, b INT);
INSERT INTO test (a , b) VALUES (1, 2);
SELECT SUM(a), SUM(b) FROM test;

where we couldn't before, but this behavior is not verified by either existing SQL tests or unit tests because neither of these forms of tests are run in anything other than interpreted execution mode. We have issues open for both of these: #1223 for the former and #1165 for the latter.

From what I can tell looking at the test infrastructures and corresponding relevant implementation details (in the DBMS itself) support for both of these will require some minor refactors:

Currently SQL tracefile tests are all run against a DBMS server instance that, while restarted between individual tracefile tests, is always restarted with the same set of server options (see e.g. here). We could refactor this to accept a set of configurations under which the DBMS should be run, or, perhaps more expedient, just add another for-loop around here to run the entirety of each tracefile test in whatever configurations we see fit.
For compiler unit tests we just hard-code the execution mode to Interpret (here). As Prashanth points out in the associated issue, it shouldn't be difficult to refactor the structure of the tests to execute all of the existing tests in multiple execution modes, but it does appear it will require some changes within the execution engine itself to account for the fact that bytecode handlers are loaded from a file whose location may vary (also pointed out in the associated issue) (stopgap may just be to add a post-build pre-test step that copies the file to the test directory?). There is also the issue of whether compiler_test.cpp should be "broken up" in some way - it is currently over 3,600 lines (the largest unit test file) which makes it somewhat harder to navigate.

Jan 22 '21 15:01 turingcompl33t

Nice analysis! Just a few points:

Can you provide performance + avg. bytecode counts for TPC-H queries individually?
Overhead in other parts of the system are most likely hiding any potential overhead incurred by this change - TPC-H SF1 should complete far faster than 56s. Can you hand-code a TPL program that stresses your modification and compare with master?
If you want to add more testing, pull the codegen harness from upstream TPL here. It runs codegen tests in all permutations of parallelism and execution modes.

Jan 22 '21 22:01 pmenon

Thanks for the feedback @pmenon!

Absolutely, should have done this in the first place, I'll run those and post them here.
Perhaps my explanation above wasn't all that clear: these results aren't really directly comparable to TPC-H at scale factor 1 because in the benchmark setup I'm running each query (at least) 100 times - I'm not sure how many times each query is run in vanilla TPC-H? So the total runtime isn't really useful apart from the relative comparison between master and the PR branch. That said, I can absolutely work up a TPL program that stresses this change, but I'm not entirely sure what we would hope to learn from it... My thinking is that I can write an adversarial test program that defines an arbitrarily-deeply nested structure and performs a load from the 0-offset position, and we'd surely see the impact of removing the optimization (1 additional bytecode for each level) but what do we learn from this other than that it is possible to construct adversarial input? I would think that the more relevant question would be: "how often does this type of structure arise naturally in queries that we care about?" which is approximately what the benchmarks above are attempting to determine. If I'm misinterpreting your request or missing the utility of it, let me know! Either way I'll work on the relevant TPL.
Will do.

Jan 25 '21 19:01 turingcompl33t

Bytecode counts for individual TPC-H queries. Need to qualify these results slightly because they are computed based on the query plans that @malin1993ml manually constructed (see this branch for reference).

Query Number	Master Bytecode Count	PR Bytecode Count	Percentage Increase
1	837	931	+11%
4	912	1036	+13%
5	1790	2032	+13%
6	316	355	+12%
7	1814	2032	+12%
11	1749	1988	+13%
16	1145	1268	+10%
18	1643	1863	+13%
19	677	748	+10%

The percentage increase in the number of bytecode instructions is relatively consistent across the individual queries (or at least, more consistent than I anticipated). Furthermore, the increases are also larger in magnitude than I expected, at least based on the previous runtime benchmark results - a ~10% increase in the number of bytecode instructions seems less trivial than what I have been making it out to be. This is likely the type of result that you may have been expecting to see @pmenon, or at least the kind that you'd want to be aware of before anything gets merged.

I am not super familiar with the internals of the interpreter, so I can't say with any authority how worrying a 10% increase in bytecode instructions is in that context. However, I can definitely say that such an increase is worrying in the context of machine code. Each of these additional bytecodes corresponds to an additional Lea in the LLVM IR; there is of course the question of whether LLVM might be able to optimize any of these additional instructions away (do we run any optimization passes in non-adapative compiled mode?) so this might not translate directly to a 10% increase in machine instructions, but it is still reason enough to warrant a second look.

Is eliding this optimization prohibitively expensive? Should I revisit the solution? The alternative that comes to mind is somehow manually "collapsing" the inconsistencies at the LLVM type-system level during bytecode generation, but I haven't actually pursued this yet to see how feasible it would be.

Jan 30 '21 16:01 turingcompl33t

Proposing a new approach to solving this problem. While the above does constitute a fix, it does not actually address the root cause. Namely, it is not the fact that we are eliding Leas that is causing the issue, but rather the fact that the Leas we do have do not function properly when confronted with nested structures, which happen to arise in parallel aggregations (but it is interesting that we don't see this phenomenon arise in other scenarios?). An example follows.

Per usual, the setup queries:

CREATE TABLE test (a INT, b INT);
INSERT INTO test (a, b) VALUES (0, 1), (1, 2);

And the query of interest:

SELECT SUM(a), SUM(b) FROM test;

For this query, we generate the following bytecode for MergeAggregates:

Function 0 <MergeAggregates>:
  Frame size 48 bytes (2 parameters, 6 locals)
    param     queryState:  offset=0       size=8       align=8       type=*struct{*ExecutionContext,struct{IntegerSumAggregate,IntegerSumAggregate}}
    param  pipelineState:  offset=8       size=8       align=8       type=*struct{struct{IntegerSumAggregate,IntegerSumAggregate}}
    local           tmp1:  offset=16      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp2:  offset=24      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp3:  offset=32      size=8       align=8       type=*IntegerSumAggregate
    local           tmp4:  offset=40      size=8       align=8       type=*IntegerSumAggregate

  0x00000000    Lea                                             local=&tmp1  local=queryState  i32=8
  0x00000010    IntegerSumAggregateMerge                        local=tmp1  local=pipelineState
  0x0000001c    Lea                                             local=&tmp2  local=queryState  i32=8
  0x0000002c    Lea                                             local=&tmp3  local=tmp2  i32=16
  0x0000003c    Lea                                             local=&tmp4  local=pipelineState  i32=16
  0x0000004c    IntegerSumAggregateMerge                        local=tmp3  local=tmp4
  0x00000058    Return

So far, so good. As we have seen before, it is the final Lea instruction that leads to LLVM type system errors upon translation to IR, but it is more than this - it also computes the incorrect result. To see why we need to look at the translation of the Lea bytecode to its corresponding IR representation (a store instruction) in LLVMEngine::DefineFunction(). The relevant part of the function is quite concise (I've simplified it a little bit, but not much):

llvm::Type *pointee_type = args[1]->getType()->getPointerElementType();
int64_t offset = llvm::cast<llvm::ConstantInt>(args[2])->getSExtValue();
if (auto struct_type = llvm::dyn_cast<llvm::StructType>(pointee_type)) {
  const uint32_t elem_index = dl.getStructLayout(struct_type)->getElementContainingOffset(offset);
  llvm::Value *addr = ir_builder->CreateStructGEP(args[1], elem_index);
  ir_builder->CreateStore(addr, args[0]);
}

So now apply this to the final Lea from the bytecode above. We are attempting to compute the address of the second IntegerSumAggregate member of the pipelineState local, where pipelineState looks like:

struct {
  struct {
    IntegerSumAggregate,
    IntegerSumAggregate
  }
}

The value of offset is 16 (an argument to the instruction). The index we compute at elem_index = ... is 0 though, and this is where the problem arises, because now we simply compute a pointer to index 0 within the outermost structure of pipelineState, which actually constitutes a pointer to the first IntegerSumAggregate member, rather than the second (as far as the subsequent IntegerSumAggregateMerge call is concerned.

So, I argue, we actually have a logic bug here in the translation of the Lea bytecode to the corresponding LLVM IR store.

Accordingly, the fix should occur here at the level of bytecode -> IR translation, rather than at the level of bytecode generation (as I was previously pursuing). I wrote a proof-of-concept fix that rewrites the logic here to account for nested structures. One complication that arises, however, is the fact that some of the types that are non-structural at the level of bytecode are nonetheless represented as structural types in the LLVM IR (e.g. SQL Integer types). This makes the code to resolve nested structures slightly more complex because we lose some of the needed type information when we go from AST types to LLVM types. To account for this, I added AST type information for locals to the FunctionLocalsMap that we instantiate when defining each function. Of course, there may be a more elegant solution available that I have not yet considered.

Any way it is implemented, this new approach has a number of benefits over the old one:

Doesn't impact interpreted query execution at all (generated bytecode is not modified)
Doesn't impact the execution time of the compiled query (the exact same IR will be generated); instead only the time required for compilation itself will be impacted, and this difference should be negligible as the number of iterations required to resolve the desired address for the store instruction scales linearly with the nesting depth of structures in the TPL bytecode

Feb 12 '21 15:02 turingcompl33t

So, I argue, we actually have a logic bug here in the translation of the Lea bytecode to the corresponding LLVM IR store.

I disagree with this. Lea is meant to be a simple (potentially scaled) pointer adjustment. It is expected that the higher levels perform the nested drill-down using multiple Leas. The optimization during bytecode generation broke this contract - the logic during LLVM codegen is correct.

Two things:

Did you collect the numbers from a smaller targeted benchmark to ascertain the potential overhead from the first (technically correct) solution? I'm curious as to what they are.
While it's possible to retain the bytecode optimization by tweaking LLVM code generation, we'd need to be able to navigate down arbitrary nestings that mix real TPL structs and arrays, and "synthetic" structs like sql::Integer. If you have a branch I can take a look at, we can work on it.

Feb 18 '21 18:02 pmenon

Came back to this PR today in light of the fact that it is blocking a number of other PRs, namely #1470 and #1402. I sat down and fired up the usual query to break the DBMS and dump the bytecode so I could look at it:

CREATE TABLE test (a INT, b INT);
INSERT INTO test (a, b) VALUES (1, 2);
SELECT SUM(a), SUM(b) FROM test;

Imagine my surprise then when the final query returned the correct results...

 ?column? | ?column? 
----------+----------
        1 |        2
(1 row)

So someone beat me to the punch (not that it was that difficult, this PR has been open since September 2020...), the question was: who? and how?

Further investigation reveals that it was none other than @lmwnshn and his PR #1482. Now, hats off to Wan for his PR and the fact that it gets these queries to run successfully, but unfortunately it really circumvents the issue, rather than addressing it directly.

Prior to Wan's PR, we generate the following bytecode for MergeAggregates for the query above:

Function 0 <MergeAggregates>:
  Frame size 48 bytes (2 parameters, 6 locals)
    param     queryState:  offset=0       size=8       align=8       type=*struct{*ExecutionContext,struct{IntegerSumAggregate,IntegerSumAggregate}}
    param  pipelineState:  offset=8       size=8       align=8       type=*struct{struct{IntegerSumAggregate,IntegerSumAggregate}}
    local           tmp1:  offset=16      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp2:  offset=24      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp3:  offset=32      size=8       align=8       type=*IntegerSumAggregate
    local           tmp4:  offset=40      size=8       align=8       type=*IntegerSumAggregate

  0x00000000    Lea                                             local=&tmp1  local=queryState  i32=8
  0x00000010    IntegerSumAggregateMerge                        local=tmp1  local=pipelineState
  0x0000001c    Lea                                             local=&tmp2  local=queryState  i32=8
  0x0000002c    Lea                                             local=&tmp3  local=tmp2  i32=16
  0x0000003c    Lea                                             local=&tmp4  local=pipelineState  i32=16
  0x0000004c    IntegerSumAggregateMerge                        local=tmp3  local=tmp4
  0x00000058    Return

And after Wan's PR, this bytecode becomes:

Function 0 <MergeAggregates>:
  Frame size 64 bytes (2 parameters, 8 locals)
    param     queryState:  offset=0       size=8       align=8       type=*struct{*ExecutionContext,struct{IntegerSumAggregate,IntegerSumAggregate}}
    param  pipelineState:  offset=8       size=8       align=8       type=*struct{TableVectorIterator,bool,[7]int8,struct{IntegerSumAggregate,IntegerSumAggregate}}
    local           tmp1:  offset=16      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp2:  offset=24      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp3:  offset=32      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp4:  offset=40      size=8       align=8       type=*IntegerSumAggregate
    local           tmp5:  offset=48      size=8       align=8       type=*struct{IntegerSumAggregate,IntegerSumAggregate}
    local           tmp6:  offset=56      size=8       align=8       type=*IntegerSumAggregate

  0x00000000    Lea                                             local=&tmp1  local=queryState  i32=8
  0x00000010    Lea                                             local=&tmp2  local=pipelineState  i32=4312
  0x00000020    IntegerSumAggregateMerge                        local=tmp1  local=tmp2
  0x0000002c    Lea                                             local=&tmp3  local=queryState  i32=8
  0x0000003c    Lea                                             local=&tmp4  local=tmp3  i32=16
  0x0000004c    Lea                                             local=&tmp5  local=pipelineState  i32=4312
  0x0000005c    Lea                                             local=&tmp6  local=tmp5  i32=16
  0x0000006c    IntegerSumAggregateMerge                        local=tmp4  local=tmp6
  0x00000078    Return

The salient point here is the updates to the way we compute the address of the second IntegerSumAggregate in the pipeline. In brief, prior to Wan's PR, we elided the fourth Lea in the second set of Lea instructions when computing this address because the member that we are loading is located at a 0 offset from the base of the structure. Eliding this Lea was leading to the incorrect result + type system error at the LLVM layer. However, with Wan's PR, we have the TableVectorIterator stashed in the pipelineState structure, and importantly it is located before the nested structure with the two IntegerSumAggregates. Because of this, the nested structure is no longer located at a 0 offset from the start of the structure in which it is embedded, the logic in codegen (VisitMemberExpr) cannot elide the Lea, and everything works as intended.

Upside: we can now execute compiled parallel aggregations successfully, but there is likely still a logic bug (or oversight) in code generation that we don't hit under any of the current tests.

Mar 19 '21 15:03 turingcompl33t

at times like this, github needs a 💩 react

Mar 19 '21 18:03 lmwnshn

Per the discussion in standup, although the updates by @lmwnshn fixed the bug as we currently experience it, I am still going to look at addressing the root cause. High-level plan now is:

Hand code some TPL that triggers the bug
Fix the bug
Profit

Mar 19 '21 21:03 turingcompl33t

noisepage noisepage copied to clipboard

Fix Compiled Parallel Aggregations

noisepage
noisepage copied to clipboard