noria icon indicating copy to clipboard operation
noria copied to clipboard

Assertion failure in Join when ancestors in other domains

Open JustusAdam opened this issue 4 years ago • 7 comments

Setup

I am trying to run a query that computes an average. The graph and the operators are generated from a different language by a compiler, but in SQL it would look something like this

SELECT sum(x) / count(*)
FROM Tab

Error

The query itself runs fine, but I wanted to test how the performance would be if count(*) and sum(x) were computed on different domains. So I hacked into assignment to force these operators on their own domains.

When I do that however the join after the two calculations tries to access a non existent index in its right ancestor. I expanded the error message (see below) which says that the right ancestor with id 4 was short (tries accessing index 2 in the other slice, which only has two elements, in the generate_row function.

This is the error message for the two domains case, in the case of four domains its the same but the id is different (because more generated ingress/egress operators)

'right (4) was short', noria-server/dataflow/src/ops/join.rs:181:21

Questions

Is there something i am missing about domains? Can I not just make any operator into its own domain? Are there any invariants around what can go on a domain and what cant?

Runtime graphs

Here are the dot graphs for two domains and four domains and for good measure the original (working) singe domain.

The relevant operators here are ohua.generated/op_s_acc_0_0 (count(*)) and ohua.generated/op_s_acc_1_0 (sum(x)) and the join afterwards. (The rest is just generated code that does some column renaming)

How to reproduce

I uploaded a branch (join-after-domain-error-reproduction) to my fork that should contain the complete state of the system necessary (including generated operators) to reproduce the error.

In the udf-benchmarks directory run cargo run --bin features avg-split-domain/two-domainsf.toml

This will run the two domain scenario. For one or four use the one-domain.toml and four-domains.toml config respectively

JustusAdam avatar Oct 29 '19 11:10 JustusAdam

I'm confused.. There's no join in the query you gave? The query looks like it'd hit the same issue as #137, no?

jonhoo avatar Oct 29 '19 13:10 jonhoo

As to your question about domain assignment, you can move most operators into arbitrary domains, as long as you do so before you call migrate (because it adds a bunch of necessary internal operators at domain boundaries). Joins are "special" in that they always require that their inputs are materialized within the same domain as themselves, so moving them may not achieve the effect that you want.

jonhoo avatar Oct 29 '19 13:10 jonhoo

You are right. So basically I give it a different description of this query and it generates one similar to #137 but without the extra views for the two different aggregations.

Interesting. What do you think, would it work if I inserted just an Identity in between the join and its ancestors?

JustusAdam avatar Oct 30 '19 14:10 JustusAdam

I think you'll have to specifically write the query such that the aggregations are done separately and then join them together, as in #137. I'm not sure what purpose the Identity would serve?

jonhoo avatar Oct 31 '19 15:10 jonhoo

The dot graphs for your multi-domain assignments look correct, and I would expect them to work. The error you get seems to indicate that you receive a record of incorrect length; are you sure that the Ohua-generated operators always produce the right output records?

@jonhoo My understanding (from looking at the graphs) is that @JustusAdam wrote the join-based version of the query (as per #137), and that he wants the aggregations to be in different domains for parallel processing. The join input materialization will use extra space (and some compute), but that's fine for his purpose.

@JustusAdam There's no need for an identity node, and it won't change anything -- the join merely forces the automatically-generated "ingress" node to be (partially) materialized, as indicated by the 3/4 symbol in the top right corner. If you added an identity node, that would get materialized instead.

ms705 avatar Oct 31 '19 16:10 ms705

Ah, good to know.

I am fairly confident that it produces the right output, because the one-domain version works just fine. But I will run a trace over it anyway to figure out if it produces bogus output at any point.

JustusAdam avatar Nov 01 '19 09:11 JustusAdam

Also I am sorry for oversimplifying the query. Yes @ms705 is correct, I am generating the join-based query from #137.

JustusAdam avatar Nov 01 '19 09:11 JustusAdam