bug: `.sql()` sometimes execute the query in order to get the schema
What happened?
The .sql() method calls _get_schema_using_query() underneath the hood, which uses the query to get the schema if a schema is not passed as an argument. The implementation of _get_schema_using_query() differs across backends, but for the most part, if the backend provides a way to analyze the query, that's what we use. If the backend doesn't, a lot of times we create a new view/table, execute the query, and then drop the view/table so that the method has no side effect. In some backends, however, we're just executing the query. Examples: Flink, PySpark.
What version of ibis are you using?
main
What backend(s) are you using, if any?
No response
Relevant log output
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
It's unclear if any of the backends are actually executing the queries (which would be a bug), or if they're returning a deferred object with schema information for us to extract.
Probably the first step here is to add a unit-test that tries to run a .sql() call that would be expensive to actually execute and then ensure that it returns in a "short" amount of time.
Investigated this, and here are the results:
| backend | cheap? | description |
|---|---|---|
| bigquery | ✔️ | uses dry_run=True which doesn't execute the query |
| clickhouse | ✔️ | creates a view then gets the view's schema |
| datafusion | ✔️ | creates a view then gets the view's schema |
| druid | ✔️ | doesn't support .sql |
| duckdb | ✔️ | calls DESCRIBE on the query, which doesn't execute the query |
| exasol | ✔️ | creates a view then gets the view's schema |
| flink | ✔️ | sql_query doesn't execute the query, it constructs a Java object using a private constructor that only sets its inputs as attributes on a class. |
| impala | ✔️ | creates a view then gets the view's schema |
| mssql | ✔️ | uses a special function sp_describe_first_result_set which gets metadata using static analysis |
| mysql | ⛔ | EXECUTES |
| oracle | ✔️ | creates a view then gets the view's schema |
| polars | ❓ | uses eager=None which can be True at init-time (not sure what that means), so will address this in a PR to explicitly be False. |
| postgres | ✔️ | creates a view then gets the view's schema |
| pyspark | ✔️ | we use SparkSession.sql to get type information, and this method doesn't execute the query before returning |
| risingwave | ✔️ | creates a view then gets the view's schema |
| snowflake | ✔️ | executes a LIMIT 0 version of the query and returns the DESCRIBE RESULT output |
| sqlite | ✔️ | creates a view then gets the view's schema |
| trino | ✔️ | creates a prepared statement from a query and gets the DESCRIBE OUTPUT output; prepared statements do not execute the query |
PR incoming for polars and mysql.