mimir
mimir copied to clipboard
Implement/optimize possible-existential queries
An existential query returns simply true or false – true if the select query has at least one result, false otherwise. For example, the following query returns true if there is at least one row in R where A = 5.
EXISTS SELECT * FROM R WHERE A = 5
When combined with SELECT POSSIBLE
queries (see #178), the resulting query essentially tests to see if there is any possible circumstance under which the query could return some value. Such queries are similar to why/why-not provenance, but not quite the same. Why/why-not provenance allows a wide variety of database modifications to serve as causal explanations (adding/removing tuples, changing values, etc...). Here, there is only one type of modification to explore: The value assigned to a variable (i.e., one emitted by a VGTerm) may be replaced by any other value from the same variable's domain.
This limitation should, in principle, make it possible to evaluate possible-existential queries without going through the full complexity of query evaluation, as any such query can be reduced to an equivalent boolean expression over variables (instantiated by VGTerms). The resulting problem reduces to 3SAT in general, but information about the VGTerm's models can be used to make evaluation more efficient. For example:
- Variables with a finite, discrete domain bound the search space
- Uncorrelated variables (i.e., variables with a different model) can be decomposed/factorized out into separate expressions
In Mimir, uncertainty is created by a type of operator we call a VGTerm (essentially a skolem term), that conceptually emits 'variables' that act as a placeholders for values assigned by an externally defined probability distribution (that we call a model). For example, let's say you have a table R(A, B). The probabilistic repair-key operator might be implemented as:
SELECT A, NTH(B, VGTerm("UniqueIdentifier", A)) FROM R GROUP BY A
(NTH returns the N'th input row to the aggregate, and the VGTerm creates variables unique for each distinct 'A'). Missing value imputation for B would be implemented as
SELECT A, CASE WHEN B IS NULL THEN VGTerm("UniqueIdentifier", ROWID) ELSE B END AS B FROM R
Now let's say you want:
EXISTS SELECT POSSIBLE * FROM ... WHERE A = 5 AND B = 3
where "..." is one of the two examples (repair key or missing value imputation). The former case reduces to the single-atom boolean expression:
{{"UniqueIdentifier", 5}} = 3
Where {{ • }} is a variable generated by a VGTerm. To solve this, we need to interrogate the model backing this variable and evaluate whether it could potentially be 3. The latter case is trickier... If you have <5,3> in R, then clearly the existential is always true. Otherwise, it resolves to a disjunction of the form:
{{"UniqueIdentifier", ROWID-1}} = 3 V {{"UniqueIdentifier", ROWID-2}} = 3 V ...
Where ROWID-1, ROWID-2, ... are obtained from the query SELECT ROWID FROM R WHERE A = 3 AND B IS NULL; A similar interrogation of the works here (although in general, we may need to settle for only probabilistic guarantees that the result is possible).
To summarize, in Mimir, any non-determinism in the probabilistic result set can be traced back to a finite, enumerable set of variables emitted by the VGTerms -- and in most of the practical examples that I've run into, this finite set is super-small.