bayeslite icon indicating copy to clipboard operation
bayeslite copied to clipboard

ESTIMATE * FROM PAIRWISE results in exception

Open marcoct opened this issue 10 years ago • 5 comments

I ran this command trying to figure out what fields "FROM PAIRWISE" returns by default, when not extended with any model estimators:

ESTIMATE * FROM PAIRWISE satellites_cc

This results in an exception:

  File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/bayesdb.py", line 158, in execute
  File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/bql.py", line 59, in execute_phrase
  File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/bql.py", line 644, in execute_wound
  File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/bayesdb.py", line 176, in sql_execute
sqlite3.OperationalError: near "AS": syntax error

I suppose the fact that "*" cannot be used related to the fact that "expression" instead of "columns" is used in the documentation:

ESTIMATE <expression> FROM PAIRWISE <generator>

However, something like "*" here would be helpful. The alternative that I used instead to reveal the returned fields is:

ESTIMATE "foo" FROM PAIRWISE satellites_cc

which I don't think should be necessary.

Also, perhaps the BQL documentation should define the fields that are returned and their relationship to the fields in the data table, e.g.:

Returns a table with three fields: 
1. rowid0: The value of field "rowid" in the data table for the first row.
2. rowid1: The value of field "rowid" in the data table for the second row.
3. value: The result of evaluating the expression on the two rows.

Also, the equivalent for columns also is not supported, e.g.:

ESTIMATE * from PAIRWISE COLUMNS OF satellites_cc
  File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/bayesdb.py", line 158, in execute
  File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/bql.py", line 56, in execute_phrase
  File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/compiler.py", line 229, in compile_query
  File "build/bdist.macosx-10.9-x86_64/egg/bayeslite/compiler.py", line 723, in compile_estpaircols
IndexError: tuple index out of range

marcoct avatar Oct 18 '15 21:10 marcoct

For the pairwise columns case, knowing that the output fields are name0 and name1 are important if the user is interested in only a subset of the columns:

estimate dependence probability as dp from pairwise columns of satellites_cc
where (name0 = "Anticipated_Lifetime" OR name0 = "Contractor") 
AND (name1 = "Purpose" OR name1 = "Eccentricity"

marcoct avatar Oct 18 '15 23:10 marcoct

The first symptom, that ESTIMATE * FROM PAIRWISE t raises sqlite3.OperationalError, indicates a bug -- the BQL->SQL compiler should never generate broken SQL, but obviously it is generating broken SQL. The bayeslite sql_trace should reveal what SQL was generated. Can you find that?

The second part, of what * means in ESTIMATE ... FROM PAIRWISE t, is...well, it has never come up before, partly because (a) there's only one thing you typically do with it, namely SIMILARITY, and nobody's tried to explore that before; but also because (b) the number of ways you can use SIMILARITY has a combinatorial explosion: nrows * 2^ncols.

As a postscript: note that "foo" is how you quote identifiers. What you wanted to do is write strings, which are written 'foo' instead. This didn't fail immediately for two reasons:

  1. The BQL->SQL compiler isn't equipped to detect this mistake at the moment -- that is the subject of issue #38.
  2. sqlite3 intentionally lets you conflate the two where unambiguous, likely for historical compatibility with other broken SQL engines -- but this may come back and bite you later, if a column is ever added to the table you're selecting from whose name happens to coincide with the text of a string you were trying to write.

riastradh-probcomp avatar Oct 19 '15 16:10 riastradh-probcomp

The fact that the space of possible model estimator expressions for use in ESTIMATE is large is interesting, and suggests a possible tension between the type of analysis this tool is supporting and this SQL-based framework. Here we have a combinatorial explosion in the space of similarity estimators. However, one could even imagine an infinite set of model estimators, parameterized by some real number, such as the indicator function of whether the entropy of the predictive distribution is greater than some number.

Are there other projects where the semantics of SELECT * are similarly troubled by an infinite space of possible things to select from?

marcoct avatar Oct 19 '15 16:10 marcoct

Um, what does combinatorial explosion have to do with it? Standard SQL permits an infinite number of possible expressions for SELECT ... FROM, and also defines that SELECT * FROM produces the columns. Methinks the real issue with defining ESTIMATE * FROM PAIRWISE is that the point of the PAIRWISE is to introduce implicit arguments for functions, so there is no sensible meaning for "all the ground information which you may recombine with function applications".

axch avatar Oct 19 '15 21:10 axch

What about having ESTIMATE * FROM PAIRWISE COLUMNS defined as producing the two column IDs and their statistical types, and having ESTIMATE * FROM PAIRWISE (ROWS) defined as producing the row IDs? Something like this would help the user quickly understand the type of collections FROM PAIRWISE returns, and also give access to important information like the statistical types in the case of columns. The import of having the row and column IDs beyond making it clear what the command is doing is less obvious. I'm guessing these row and column IDs are linked to other tables?

marcoct avatar Oct 19 '15 22:10 marcoct