noisepage
noisepage copied to clipboard
Fix Compiled Parallel Aggregations
Closes #1099.
Parallel static aggregations fail in compiled mode. A simple procedure to reproduce follows:
CREATE TABLE test (a INT, b INT);
INSERT INTO test (a , b) VALUES (1, 2);
SELECT SUM(a), SUM(b) FROM test;
Executing the final query produces the following error messages:
[2020-09-17 18:31:34.843] [execution_logger] [error] ERROR IN MODULE:
Stored value type does not match pointer operand type!
store %1* %16, %"class.terrier::execution::sql::IntegerSumAggregate"** %5
%"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
store %1* %22, %"class.terrier::execution::sql::IntegerSumAggregate"** %8
%"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
store %1* %5, %"class.terrier::execution::sql::IntegerSumAggregate"** %2
%"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
store %1* %7, %"class.terrier::execution::sql::IntegerSumAggregate"** %3
%"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
store %1* %31, %"class.terrier::execution::sql::IntegerSumAggregate"** %10
%"class.terrier::execution::sql::IntegerSumAggregate"*Stored value type does not match pointer operand type!
store %1* %35, %"class.terrier::execution::sql::IntegerSumAggregate"** %12
%"class.terrier::execution::sql::IntegerSumAggregate"*
The majority of this error message is produced by the Verifier::visitStoreInst()
function in Verifier.cpp
within LLVM. As the message suggests, we generate IR with a type mismatch between the stored value and the location (address) at which it is stored. The (simplified) syntax for an LLVM IR store
looks like: store <ty> <value> <ty>* <pointer>
and the LLVM type system (enforced in part by Verifier.cpp
) requires that "the type of the
The issue arises in the MergeAggregates
function. We produce the following bytecode for MergeAggregates
:
Function 0 <MergeAggregates>:
Frame size 48 bytes (2 parameters, 6 locals)
param queryState: offset=0 size=8 align=8 type=*struct{*ExecutionContext,struct{IntegerSumAggregate,IntegerSumAggregate}}
param pipelineState: offset=8 size=8 align=8 type=*struct{struct{IntegerSumAggregate,IntegerSumAggregate}}
local tmp1: offset=16 size=8 align=8 type=*struct{IntegerSumAggregate,IntegerSumAggregate}
local tmp2: offset=24 size=8 align=8 type=*struct{IntegerSumAggregate,IntegerSumAggregate}
local tmp3: offset=32 size=8 align=8 type=*IntegerSumAggregate
local tmp4: offset=40 size=8 align=8 type=*IntegerSumAggregate
0x00000000 Lea local=&tmp1 local=queryState i32=8
0x00000010 IntegerSumAggregateMerge local=tmp1 local=pipelineState
0x0000001c Lea local=&tmp2 local=queryState i32=8
0x0000002c Lea local=&tmp3 local=tmp2 i32=16
0x0000003c Lea local=&tmp4 local=pipelineState i32=16
0x0000004c IntegerSumAggregateMerge local=tmp3 local=tmp4
0x00000058 Return
The error occurs at the final Lea
bytecode: Lea &tmp4, pipelineState, 16
. Within the LLVM engine, the Lea
TPL instruction is implemented in terms of a store
in LLVM IR.
We have that the pipelineState
type looks like:
struct {
struct {
IntegerSumAggregate,
IntegerSumAggregate
}
}
While the type of local tmp4
is simply *IntegerSumAggregate
.
To construct the store
instruction from the bytecode, we perform the following procedure:
- Get the offset from the 2nd (0-indexed) operand to the
Lea
instruction (16) - Compute the element within
pipelineState
that contains this offset;pipelineState
is a structure with only a single member (itself a structure) so the result of this computation is the sole member ofpipelineState
with type:struct {IntegerSumAggregate, IntegerSumAggregate}
- Compute the index of this element within the top level structure (
pipelineState
); this calculation results in 0 as the index - Compute a pointer to the element at the index computed above; this results in a pointer with type
*struct {IntegerSumAggregate, IntegerSumAggregate}
- Create a
store
instruction that stores this computed pointer to&tmp4
The resulting store
IR therefore looks like (in pseudo-IR):
store *struct{IntegerSumAggregate, IntegerSumAggregate} val, **IntegerSumAggregate ptr
Verifier::visitStoreInst()
contains an assertion that verifies (essentially) the following type constraint:
typeof(*ptr) == typeof(val)
We fail this assertion because we attempt to store a pointer to the structure struct {IntegerSumAggregate, IntegerSumAggregate}
to a location that expects a pointer to a lone IntegerSumAggregate
.
Some preliminary results from the benchmarks supported by OLTPBench. About to run through TPC-H benchmarks with the setup suggested by @thepinetree.
Benchmarks run on Ubuntu 20.04 server instances (c5.9xlarge
) on AWS with the following specs:
- 18 X 3404.27 MHz CPUs
- L1 Data 32 KiB (x18)
- L1 Instruction 32 KiB (x18)
- L2 Unified 1024 KiB (x18)
- L3 Unified 25344 KiB (x1)
Results for the benchmarks are averaged over 5 runs.
Average Bytecode Count
Benchmark | Master | Optimization Removed |
---|---|---|
TPC-C | 124 | 141 |
Smallbank | 58 | 67 |
TATP | 197 | 227 |
Benchmarks (Interpreted)
Benchmark | Master | Optimization Removed |
---|---|---|
TPC-C | 337 tps | 337 tps |
Smallbank | 595 tps | 590 tps |
TATP | 442 tps | 426 tps |
Benchmarks (Compiled)
Benchmark | Master | Optimization Removed |
---|---|---|
TPC-C | 337 tps | 327 tps |
Smallbank | 594 tps | 588 tps |
TATP | 545 tps | 537 tps |
Results from running TPC-H benchmarks comparing the current state of master
and this PR branch. I ran these benchmarks using the "custom" TPC-H setup found in the latest commit on @thepinetree's branch. Scale factor is 1. Each query is executed a minimum of 100 times in each benchmark. Runtimes are averaged over 10 runs; the standard deviation among individual measurements is included following the average runtime (in parentheses).
Benchmarks run on Ubuntu 20.04 server instances (c5.9xlarge) on AWS with the following specs:
- 18 X 3404.27 MHz CPUs
- L1 Data 32 KiB (x18)
- L1 Instruction 32 KiB (x18)
- L2 Unified 1024 KiB (x18)
- L3 Unified 25344 KiB (x1)
Interpreted
Master | PR | Diff |
---|---|---|
77392ms (759) | 77182ms (570) | -0.27% |
Compiled
Master | PR | Diff |
---|---|---|
56000ms (42) | 55961ms (87) | -0.07% |
In both interpreted and compiled execution modes, we see a very slight decrease in the overall runtime of the benchmark. Obviously it makes no sense that adding additional byetcode to be interpreted / compiled makes the system run faster; I think the takeaway from these results is: the overhead of introducing the additional bytecode (rather, not eliding the Lea
in circumstances in which it is possible) has such a negligible impact on overall performance that the difference is lost in the noise of the system, despite the fact that Google Benchmark works hard to give us reproducible results and all of the benchmarks were run multiple times independently.
Interested to hear what other's takes may be. I also think it may be worthwhile at this point to mark this "ready for CI" to determine if perfguard uncovers anything different, although I suppose it has also been showing spurious changes in benchmark performance, so it is difficult to arrive at a definitive answer.
The other thing to consider here is the fact that, currently, we don't have any automatic testing mechanism to verify this functionality. For instance, we can now run queries like
CREATE TABLE test (a INT, b INT);
INSERT INTO test (a , b) VALUES (1, 2);
SELECT SUM(a), SUM(b) FROM test;
where we couldn't before, but this behavior is not verified by either existing SQL tests or unit tests because neither of these forms of tests are run in anything other than interpreted execution mode. We have issues open for both of these: #1223 for the former and #1165 for the latter.
From what I can tell looking at the test infrastructures and corresponding relevant implementation details (in the DBMS itself) support for both of these will require some minor refactors:
- Currently SQL tracefile tests are all run against a DBMS server instance that, while restarted between individual tracefile tests, is always restarted with the same set of server options (see e.g. here). We could refactor this to accept a set of configurations under which the DBMS should be run, or, perhaps more expedient, just add another
for
-loop around here to run the entirety of each tracefile test in whatever configurations we see fit. - For compiler unit tests we just hard-code the execution mode to
Interpret
(here). As Prashanth points out in the associated issue, it shouldn't be difficult to refactor the structure of the tests to execute all of the existing tests in multiple execution modes, but it does appear it will require some changes within the execution engine itself to account for the fact that bytecode handlers are loaded from a file whose location may vary (also pointed out in the associated issue) (stopgap may just be to add a post-build pre-test step that copies the file to the test directory?). There is also the issue of whethercompiler_test.cpp
should be "broken up" in some way - it is currently over 3,600 lines (the largest unit test file) which makes it somewhat harder to navigate.
Nice analysis! Just a few points:
- Can you provide performance + avg. bytecode counts for TPC-H queries individually?
- Overhead in other parts of the system are most likely hiding any potential overhead incurred by this change - TPC-H SF1 should complete far faster than 56s. Can you hand-code a TPL program that stresses your modification and compare with master?
- If you want to add more testing, pull the codegen harness from upstream TPL here. It runs codegen tests in all permutations of parallelism and execution modes.
Thanks for the feedback @pmenon!
- Absolutely, should have done this in the first place, I'll run those and post them here.
- Perhaps my explanation above wasn't all that clear: these results aren't really directly comparable to TPC-H at scale factor 1 because in the benchmark setup I'm running each query (at least) 100 times - I'm not sure how many times each query is run in vanilla TPC-H? So the total runtime isn't really useful apart from the relative comparison between master and the PR branch. That said, I can absolutely work up a TPL program that stresses this change, but I'm not entirely sure what we would hope to learn from it... My thinking is that I can write an adversarial test program that defines an arbitrarily-deeply nested structure and performs a load from the 0-offset position, and we'd surely see the impact of removing the optimization (1 additional bytecode for each level) but what do we learn from this other than that it is possible to construct adversarial input? I would think that the more relevant question would be: "how often does this type of structure arise naturally in queries that we care about?" which is approximately what the benchmarks above are attempting to determine. If I'm misinterpreting your request or missing the utility of it, let me know! Either way I'll work on the relevant TPL.
- Will do.
Bytecode counts for individual TPC-H queries. Need to qualify these results slightly because they are computed based on the query plans that @malin1993ml manually constructed (see this branch for reference).
Query Number | Master Bytecode Count | PR Bytecode Count | Percentage Increase |
---|---|---|---|
1 | 837 | 931 | +11% |
4 | 912 | 1036 | +13% |
5 | 1790 | 2032 | +13% |
6 | 316 | 355 | +12% |
7 | 1814 | 2032 | +12% |
11 | 1749 | 1988 | +13% |
16 | 1145 | 1268 | +10% |
18 | 1643 | 1863 | +13% |
19 | 677 | 748 | +10% |
The percentage increase in the number of bytecode instructions is relatively consistent across the individual queries (or at least, more consistent than I anticipated). Furthermore, the increases are also larger in magnitude than I expected, at least based on the previous runtime benchmark results - a ~10% increase in the number of bytecode instructions seems less trivial than what I have been making it out to be. This is likely the type of result that you may have been expecting to see @pmenon, or at least the kind that you'd want to be aware of before anything gets merged.
I am not super familiar with the internals of the interpreter, so I can't say with any authority how worrying a 10% increase in bytecode instructions is in that context. However, I can definitely say that such an increase is worrying in the context of machine code. Each of these additional bytecodes corresponds to an additional Lea
in the LLVM IR; there is of course the question of whether LLVM might be able to optimize any of these additional instructions away (do we run any optimization passes in non-adapative compiled mode?) so this might not translate directly to a 10% increase in machine instructions, but it is still reason enough to warrant a second look.
Is eliding this optimization prohibitively expensive? Should I revisit the solution? The alternative that comes to mind is somehow manually "collapsing" the inconsistencies at the LLVM type-system level during bytecode generation, but I haven't actually pursued this yet to see how feasible it would be.
Proposing a new approach to solving this problem. While the above does constitute a fix, it does not actually address the root cause. Namely, it is not the fact that we are eliding Lea
s that is causing the issue, but rather the fact that the Lea
s we do have do not function properly when confronted with nested structures, which happen to arise in parallel aggregations (but it is interesting that we don't see this phenomenon arise in other scenarios?). An example follows.
Per usual, the setup queries:
CREATE TABLE test (a INT, b INT);
INSERT INTO test (a, b) VALUES (0, 1), (1, 2);
And the query of interest:
SELECT SUM(a), SUM(b) FROM test;
For this query, we generate the following bytecode for MergeAggregates
:
Function 0 <MergeAggregates>:
Frame size 48 bytes (2 parameters, 6 locals)
param queryState: offset=0 size=8 align=8 type=*struct{*ExecutionContext,struct{IntegerSumAggregate,IntegerSumAggregate}}
param pipelineState: offset=8 size=8 align=8 type=*struct{struct{IntegerSumAggregate,IntegerSumAggregate}}
local tmp1: offset=16 size=8 align=8 type=*struct{IntegerSumAggregate,IntegerSumAggregate}
local tmp2: offset=24 size=8 align=8 type=*struct{IntegerSumAggregate,IntegerSumAggregate}
local tmp3: offset=32 size=8 align=8 type=*IntegerSumAggregate
local tmp4: offset=40 size=8 align=8 type=*IntegerSumAggregate
0x00000000 Lea local=&tmp1 local=queryState i32=8
0x00000010 IntegerSumAggregateMerge local=tmp1 local=pipelineState
0x0000001c Lea local=&tmp2 local=queryState i32=8
0x0000002c Lea local=&tmp3 local=tmp2 i32=16
0x0000003c Lea local=&tmp4 local=pipelineState i32=16
0x0000004c IntegerSumAggregateMerge local=tmp3 local=tmp4
0x00000058 Return
So far, so good. As we have seen before, it is the final Lea
instruction that leads to LLVM type system errors upon translation to IR, but it is more than this - it also computes the incorrect result. To see why we need to look at the translation of the Lea
bytecode to its corresponding IR representation (a store
instruction) in LLVMEngine::DefineFunction()
. The relevant part of the function is quite concise (I've simplified it a little bit, but not much):
llvm::Type *pointee_type = args[1]->getType()->getPointerElementType();
int64_t offset = llvm::cast<llvm::ConstantInt>(args[2])->getSExtValue();
if (auto struct_type = llvm::dyn_cast<llvm::StructType>(pointee_type)) {
const uint32_t elem_index = dl.getStructLayout(struct_type)->getElementContainingOffset(offset);
llvm::Value *addr = ir_builder->CreateStructGEP(args[1], elem_index);
ir_builder->CreateStore(addr, args[0]);
}
So now apply this to the final Lea
from the bytecode above. We are attempting to compute the address of the second IntegerSumAggregate
member of the pipelineState
local, where pipelineState
looks like:
struct {
struct {
IntegerSumAggregate,
IntegerSumAggregate
}
}
The value of offset
is 16 (an argument to the instruction). The index we compute at elem_index = ...
is 0
though, and this is where the problem arises, because now we simply compute a pointer to index 0
within the outermost structure of pipelineState
, which actually constitutes a pointer to the first IntegerSumAggregate
member, rather than the second (as far as the subsequent IntegerSumAggregateMerge
call is concerned.
So, I argue, we actually have a logic bug here in the translation of the Lea
bytecode to the corresponding LLVM IR store
.
Accordingly, the fix should occur here at the level of bytecode -> IR translation, rather than at the level of bytecode generation (as I was previously pursuing). I wrote a proof-of-concept fix that rewrites the logic here to account for nested structures. One complication that arises, however, is the fact that some of the types that are non-structural at the level of bytecode are nonetheless represented as structural types in the LLVM IR (e.g. SQL Integer
types). This makes the code to resolve nested structures slightly more complex because we lose some of the needed type information when we go from AST types to LLVM types. To account for this, I added AST type information for locals to the FunctionLocalsMap
that we instantiate when defining each function. Of course, there may be a more elegant solution available that I have not yet considered.
Any way it is implemented, this new approach has a number of benefits over the old one:
- Doesn't impact interpreted query execution at all (generated bytecode is not modified)
- Doesn't impact the execution time of the compiled query (the exact same IR will be generated); instead only the time required for compilation itself will be impacted, and this difference should be negligible as the number of iterations required to resolve the desired address for the
store
instruction scales linearly with the nesting depth of structures in the TPL bytecode
So, I argue, we actually have a logic bug here in the translation of the Lea bytecode to the corresponding LLVM IR store.
I disagree with this. Lea
is meant to be a simple (potentially scaled) pointer adjustment. It is expected that the higher levels perform the nested drill-down using multiple Lea
s. The optimization during bytecode generation broke this contract - the logic during LLVM codegen is correct.
Two things:
- Did you collect the numbers from a smaller targeted benchmark to ascertain the potential overhead from the first (technically correct) solution? I'm curious as to what they are.
- While it's possible to retain the bytecode optimization by tweaking LLVM code generation, we'd need to be able to navigate down arbitrary nestings that mix real TPL structs and arrays, and "synthetic" structs like
sql::Integer
. If you have a branch I can take a look at, we can work on it.
Came back to this PR today in light of the fact that it is blocking a number of other PRs, namely #1470 and #1402. I sat down and fired up the usual query to break the DBMS and dump the bytecode so I could look at it:
CREATE TABLE test (a INT, b INT);
INSERT INTO test (a, b) VALUES (1, 2);
SELECT SUM(a), SUM(b) FROM test;
Imagine my surprise then when the final query returned the correct results...
?column? | ?column?
----------+----------
1 | 2
(1 row)
So someone beat me to the punch (not that it was that difficult, this PR has been open since September 2020...), the question was: who? and how?
Further investigation reveals that it was none other than @lmwnshn and his PR #1482. Now, hats off to Wan for his PR and the fact that it gets these queries to run successfully, but unfortunately it really circumvents the issue, rather than addressing it directly.
Prior to Wan's PR, we generate the following bytecode for MergeAggregates
for the query above:
Function 0 <MergeAggregates>:
Frame size 48 bytes (2 parameters, 6 locals)
param queryState: offset=0 size=8 align=8 type=*struct{*ExecutionContext,struct{IntegerSumAggregate,IntegerSumAggregate}}
param pipelineState: offset=8 size=8 align=8 type=*struct{struct{IntegerSumAggregate,IntegerSumAggregate}}
local tmp1: offset=16 size=8 align=8 type=*struct{IntegerSumAggregate,IntegerSumAggregate}
local tmp2: offset=24 size=8 align=8 type=*struct{IntegerSumAggregate,IntegerSumAggregate}
local tmp3: offset=32 size=8 align=8 type=*IntegerSumAggregate
local tmp4: offset=40 size=8 align=8 type=*IntegerSumAggregate
0x00000000 Lea local=&tmp1 local=queryState i32=8
0x00000010 IntegerSumAggregateMerge local=tmp1 local=pipelineState
0x0000001c Lea local=&tmp2 local=queryState i32=8
0x0000002c Lea local=&tmp3 local=tmp2 i32=16
0x0000003c Lea local=&tmp4 local=pipelineState i32=16
0x0000004c IntegerSumAggregateMerge local=tmp3 local=tmp4
0x00000058 Return
And after Wan's PR, this bytecode becomes:
Function 0 <MergeAggregates>:
Frame size 64 bytes (2 parameters, 8 locals)
param queryState: offset=0 size=8 align=8 type=*struct{*ExecutionContext,struct{IntegerSumAggregate,IntegerSumAggregate}}
param pipelineState: offset=8 size=8 align=8 type=*struct{TableVectorIterator,bool,[7]int8,struct{IntegerSumAggregate,IntegerSumAggregate}}
local tmp1: offset=16 size=8 align=8 type=*struct{IntegerSumAggregate,IntegerSumAggregate}
local tmp2: offset=24 size=8 align=8 type=*struct{IntegerSumAggregate,IntegerSumAggregate}
local tmp3: offset=32 size=8 align=8 type=*struct{IntegerSumAggregate,IntegerSumAggregate}
local tmp4: offset=40 size=8 align=8 type=*IntegerSumAggregate
local tmp5: offset=48 size=8 align=8 type=*struct{IntegerSumAggregate,IntegerSumAggregate}
local tmp6: offset=56 size=8 align=8 type=*IntegerSumAggregate
0x00000000 Lea local=&tmp1 local=queryState i32=8
0x00000010 Lea local=&tmp2 local=pipelineState i32=4312
0x00000020 IntegerSumAggregateMerge local=tmp1 local=tmp2
0x0000002c Lea local=&tmp3 local=queryState i32=8
0x0000003c Lea local=&tmp4 local=tmp3 i32=16
0x0000004c Lea local=&tmp5 local=pipelineState i32=4312
0x0000005c Lea local=&tmp6 local=tmp5 i32=16
0x0000006c IntegerSumAggregateMerge local=tmp4 local=tmp6
0x00000078 Return
The salient point here is the updates to the way we compute the address of the second IntegerSumAggregate
in the pipeline. In brief, prior to Wan's PR, we elided the fourth Lea
in the second set of Lea
instructions when computing this address because the member that we are loading is located at a 0
offset from the base of the structure. Eliding this Lea
was leading to the incorrect result + type system error at the LLVM layer. However, with Wan's PR, we have the TableVectorIterator
stashed in the pipelineState
structure, and importantly it is located before the nested structure with the two IntegerSumAggregates
. Because of this, the nested structure is no longer located at a 0
offset from the start of the structure in which it is embedded, the logic in codegen (VisitMemberExpr
) cannot elide the Lea
, and everything works as intended.
Upside: we can now execute compiled parallel aggregations successfully, but there is likely still a logic bug (or oversight) in code generation that we don't hit under any of the current tests.
at times like this, github needs a 💩 react
Per the discussion in standup, although the updates by @lmwnshn fixed the bug as we currently experience it, I am still going to look at addressing the root cause. High-level plan now is:
- Hand code some TPL that triggers the bug
- Fix the bug
- Profit