typedb
typedb copied to clipboard
Revise how reasoner planner does answer count estimates
What is the goal of this PR?
This PR introduces a 'connectivity-graph' abstraction to simplify how the reasoner planner's answer count estimator models the effect of constraints in a conjunction on each other.
What are the changes implemented in this PR?
The reasoner planner was originally designed to estimate the answer count of a conjunction of constraints as the greedy-cover of the estimates of the variables by the individual constraints. Later, we would scale down estimates based on the variables it shared with other constraints. e.g.: If a constraint A had an estimate of A.x for a variable $x
, And a constraint B had an estimate of B.x, such that B.x < A.x; We would scale the model of A by a factor (B.X/A.x) - applying the selectivity of B to A.
The drawback of the greedy-cover is that the connectivity constraints were local, and would not propagate across multiple resolvables.[1]
In the new implementation, we create a "connectivity-graph" representing the expected answer graph. The graph reflects the structure of the query, with each vertex (or edge) labelled with the expected number of corresponding vertices (or edges) in the answer graph. Further, we do a minimal-spanning-tree style estimate to replace the greedy set-cover - by construction of the connectivity graph, the connectivity constraints apply across multiple resolvables.
Constructing the connectivity graph
The AnswerCountEstimator has a model for each constraint in the conjunction. Each of these can be mapped to a connectivity subgraph, where the model gives us the labels of the vertices and edges. e.g., The model for a has
concludable will have an estimate for each variable and for the number of has
edges between them. In the corresponding connectivity (sub)graphs, these will be the labels of the vertices and edges respectively.
When we combine two connectivity (sub)graphs, we "scale" down the variable labels on each so they agree on the estimates of common variables. We then propagate this by scaling down the number of edges connected to the variable. If the number of edges is less than the vertex on the other end, we scale it down to the number of edges and propagate it further.
Estimating answer count from the graph
The answer count of a conjunction is the set of all (combinations) of answers to the variables in the conjunction.
If $v(x)$ is the label for the vertex corresponding to $x
in the graph, and $e(x, y)$ is the label for the (undirected) edge between $x
and $y
, We define the directed connectivity from $x
to $y
as $c(x, y) = e(x, y)/x$.
First, consider the simplified problem of finding the answers to all variables (v/s some subset) for a query with a tree structure.
We could start an any vertex $x
and compute the product of connectivities for each edge in the tree rooted at $x
(for the direction root -> leaf).
$ answers = v(x) \times \prod_{y\in adj(x)} { c(x,y) \prod_{z \in adj(y)} {c(y,z) * ...} } $
Two complications arise:
- When the query is not tree structured, neither is our connectivity graph, and there is not a unique tree. We ideally would do the same for the tree which yields the minimum result for the $answers$ . This corresponds to the minimum weight directed spanning tree. Since we're taking the products of connectivity, the weights of the edge have to be the log of the actual connectivities of the directed edges.
- When we are only querying the answer count of a subset of variables, we cannot simply multiply the connectivities because the estimate for how many vertices of variable
$x
we are connected to might exceed the number of unique vertices possible for$x
(i.e. $v(x)$) (see [3]). For this, we have to cap the product we're computing at each vertex so it does not exceed $v(x)$ %
Note that the minimum spanning tree for the set of all variables may not be the same for a subset of the variables. Ideally, we would find a combined solution to both the problems.
[1] if we had 15 each of $x, $y, and $z. One $x is connected to exactly one unique $y through a constraint (x,y) and one $y is connected to one unique $z through ($y,$z). The greedy cover would pick both (x,y) and (y,z) but would have an estimate of 15 from (x,y) and 15 for (y,z), yielding an answer of 225. The correct answer would be to see that each $x is connected to one $y and one $z in turn, for a total answer count of 15.
[2]: When querying the answers to {$x,$y} for a query-graph ($x-$y-$z), We may have one $x, connected to five $y, each connected to one $z. The minimal spanning tree would have a weight of 5, whereas the number of answers cannot exceed 1 (the only pair of $x, $y which exist in the database).
PR Review Checklist
Do not edit the content of this comment. The PR reviewer should simply update this comment by ticking each review item below, as they get completed.
Trivial Change
- [ ] This change is trivial and does not require a code or architecture review.
Code
- [ ] Packages, classes, and methods have a single domain of responsibility.
- [ ] Packages, classes, and methods are grouped into cohesive and consistent domain model.
- [ ] The code is canonical and the minimum required to achieve the goal.
- [ ] Modules, libraries, and APIs are easy to use, robust (foolproof and not errorprone), and tested.
- [ ] Logic and naming has clear narrative that communicates the accurate intent and responsibility of each module (e.g. method, class, etc.).
- [ ] The code is algorithmically efficient and scalable for the whole application.
Architecture
- [ ] Any required refactoring is completed, and the architecture does not introduce technical debt incidentally.
- [ ] Any required build and release automations are updated and/or implemented.
- [ ] Any new components follows a consistent style with respect to the pre-existing codebase.
- [ ] The architecture intuitively reflects the application domain, and is easy to understand.
- [ ] The architecture has a well-defined hierarchy of encapsulated components.
- [ ] The architecture is extensible and scalable.