bayeslite
bayeslite copied to clipboard
ESTIMATE * FROM PAIRWISE results in exception
I ran this command trying to figure out what fields "FROM PAIRWISE" returns by default, when not extended with any model estimators:
ESTIMATE * FROM PAIRWISE satellites_cc
This results in an exception:
File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/bayesdb.py", line 158, in execute
File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/bql.py", line 59, in execute_phrase
File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/bql.py", line 644, in execute_wound
File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/bayesdb.py", line 176, in sql_execute
sqlite3.OperationalError: near "AS": syntax error
I suppose the fact that "*" cannot be used related to the fact that "expression" instead of "columns" is used in the documentation:
ESTIMATE <expression> FROM PAIRWISE <generator>
However, something like "*" here would be helpful. The alternative that I used instead to reveal the returned fields is:
ESTIMATE "foo" FROM PAIRWISE satellites_cc
which I don't think should be necessary.
Also, perhaps the BQL documentation should define the fields that are returned and their relationship to the fields in the data table, e.g.:
Returns a table with three fields:
1. rowid0: The value of field "rowid" in the data table for the first row.
2. rowid1: The value of field "rowid" in the data table for the second row.
3. value: The result of evaluating the expression on the two rows.
Also, the equivalent for columns also is not supported, e.g.:
ESTIMATE * from PAIRWISE COLUMNS OF satellites_cc
File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/bayesdb.py", line 158, in execute
File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/bql.py", line 56, in execute_phrase
File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/compiler.py", line 229, in compile_query
File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/compiler.py", line 723, in compile_estpaircols
IndexError: tuple index out of range
For the pairwise columns case, knowing that the output fields are name0 and name1 are important if the user is interested in only a subset of the columns:
estimate dependence probability as dp from pairwise columns of satellites_cc
where (name0 = "Anticipated_Lifetime" OR name0 = "Contractor")
AND (name1 = "Purpose" OR name1 = "Eccentricity"
The first symptom, that ESTIMATE * FROM PAIRWISE t raises sqlite3.OperationalError, indicates a bug -- the BQL->SQL compiler should never generate broken SQL, but obviously it is generating broken SQL. The bayeslite sql_trace should reveal what SQL was generated. Can you find that?
The second part, of what * means in ESTIMATE ... FROM PAIRWISE t, is...well, it has never come up before, partly because (a) there's only one thing you typically do with it, namely SIMILARITY, and nobody's tried to explore that before; but also because (b) the number of ways you can use SIMILARITY has a combinatorial explosion: nrows * 2^ncols.
As a postscript: note that "foo" is how you quote identifiers. What you wanted to do is write strings, which are written 'foo' instead. This didn't fail immediately for two reasons:
- The BQL->SQL compiler isn't equipped to detect this mistake at the moment -- that is the subject of issue #38.
- sqlite3 intentionally lets you conflate the two where unambiguous, likely for historical compatibility with other broken SQL engines -- but this may come back and bite you later, if a column is ever added to the table you're selecting from whose name happens to coincide with the text of a string you were trying to write.
The fact that the space of possible model estimator expressions for use in ESTIMATE is large is interesting, and suggests a possible tension between the type of analysis this tool is supporting and this SQL-based framework. Here we have a combinatorial explosion in the space of similarity estimators. However, one could even imagine an infinite set of model estimators, parameterized by some real number, such as the indicator function of whether the entropy of the predictive distribution is greater than some number.
Are there other projects where the semantics of SELECT * are similarly troubled by an infinite space of possible things to select from?
Um, what does combinatorial explosion have to do with it? Standard SQL permits an infinite number of possible expressions for SELECT ... FROM, and also defines that SELECT * FROM produces the columns. Methinks the real issue with defining ESTIMATE * FROM PAIRWISE is that the point of the PAIRWISE is to introduce implicit arguments for functions, so there is no sensible meaning for "all the ground information which you may recombine with function applications".
What about having ESTIMATE * FROM PAIRWISE COLUMNS defined as producing the two column IDs and their statistical types, and having ESTIMATE * FROM PAIRWISE (ROWS) defined as producing the row IDs? Something like this would help the user quickly understand the type of collections FROM PAIRWISE returns, and also give access to important information like the statistical types in the case of columns. The import of having the row and column IDs beyond making it clear what the command is doing is less obvious. I'm guessing these row and column IDs are linked to other tables?