jupysql icon indicating copy to clipboard operation
jupysql copied to clipboard

Use an alternative output format when using DuckDB

Open edublancas opened this issue 2 years ago • 4 comments

DuckDB provides its own output format that we can leverage instead of converting things into ResultSet so if we're using a DuckDB connection we should use such format instead, see https://github.com/ploomber/jupysql/issues/451#issuecomment-1528826483 for details

edublancas avatar May 29 '23 16:05 edublancas

As a DuckDB user, this would make jupysql much more useful. The DuckDB DuckDBPyRelation is a lazy table representation that provides a bunch of useful integrations, including building further lazy queries using the relational (non-SQL) interface and also converting to multiple output formats (beyond just Pandas and Polars dataframes).

Perhaps this could be implemented via keeping the initial default the ResultSet, and then allowing users to opt into producing DuckDB relations by default, similar to how you can set the default output to Pandas or Polars dataframes. eg something like:

%config SqlMagic.autoduckdb = True

This would result in an out of the box experience that's consistent with other back ends, but allow users to opt into the much more useful DuckDBPyRelation as the default output.

ned2 avatar Oct 02 '23 04:10 ned2

Hey @edublancas, I've made a proof of concept for my above suggestion here. If you think this proposal has legs, I'd be up for working it into a PR.

I'm not really that familiar with the jupysql codebase though, so I'm sure I'm overlooking some things. Would be keen to hear alternative suggestions!

ned2 avatar Oct 02 '23 05:10 ned2

@ned2: thanks for working on this! I ran some quick tests, and here are my thoughts.

You're on the right track; I think the best place to apply this change is in run.py. This is the line where we execute the user's query. So we need to modify both run.py and raw_execute.

The critical consideration here is that there are two ways for users to connect to duckdb: sqlalchemy (via duckdb-engine) or with a native duckdb.connect() connection. So, we need a way for both scenarios to behave consistently. I think let's tackle the case where users have a native connection because it's simpler, then, we can modify the sqlalchemy case.

So to fix this, we need to create a new method similar to DBAPIConnection.raw_execute but that calls sql (or query? are they the same?) , instead of execute. if we do that, we'll get a DuckDBPyRelation

Then, in run.py we can decide which method to run, depending on the configuration (regular raw_execute vs the new one).

I'm unsure if autoduckdb is the best name, but let's get this working first and we can change the name later.

feel free to ask any questions if this explanation isn't clear! happy to review a draft PR.

edublancas avatar Oct 03 '23 00:10 edublancas