ibis icon indicating copy to clipboard operation
ibis copied to clipboard

bug: `.sql()` sometimes execute the query in order to get the schema

Open chloeh13q opened this issue 1 year ago • 1 comments

What happened?

The .sql() method calls _get_schema_using_query() underneath the hood, which uses the query to get the schema if a schema is not passed as an argument. The implementation of _get_schema_using_query() differs across backends, but for the most part, if the backend provides a way to analyze the query, that's what we use. If the backend doesn't, a lot of times we create a new view/table, execute the query, and then drop the view/table so that the method has no side effect. In some backends, however, we're just executing the query. Examples: Flink, PySpark.

What version of ibis are you using?

main

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

chloeh13q avatar May 08 '24 22:05 chloeh13q

It's unclear if any of the backends are actually executing the queries (which would be a bug), or if they're returning a deferred object with schema information for us to extract.

Probably the first step here is to add a unit-test that tries to run a .sql() call that would be expensive to actually execute and then ensure that it returns in a "short" amount of time.

gforsyth avatar May 10 '24 17:05 gforsyth

Investigated this, and here are the results:

backend cheap? description
bigquery ✔️ uses dry_run=True which doesn't execute the query
clickhouse ✔️ creates a view then gets the view's schema
datafusion ✔️ creates a view then gets the view's schema
druid ✔️ doesn't support .sql
duckdb ✔️ calls DESCRIBE on the query, which doesn't execute the query
exasol ✔️ creates a view then gets the view's schema
flink ✔️ sql_query doesn't execute the query, it constructs a Java object using a private constructor that only sets its inputs as attributes on a class.
impala ✔️ creates a view then gets the view's schema
mssql ✔️ uses a special function sp_describe_first_result_set which gets metadata using static analysis
mysql EXECUTES
oracle ✔️ creates a view then gets the view's schema
polars uses eager=None which can be True at init-time (not sure what that means), so will address this in a PR to explicitly be False.
postgres ✔️ creates a view then gets the view's schema
pyspark ✔️ we use SparkSession.sql to get type information, and this method doesn't execute the query before returning
risingwave ✔️ creates a view then gets the view's schema
snowflake ✔️ executes a LIMIT 0 version of the query and returns the DESCRIBE RESULT output
sqlite ✔️ creates a view then gets the view's schema
trino ✔️ creates a prepared statement from a query and gets the DESCRIBE OUTPUT output; prepared statements do not execute the query

PR incoming for polars and mysql.

cpcloud avatar Jun 02 '24 13:06 cpcloud