jupysql icon indicating copy to clipboard operation
jupysql copied to clipboard

SqlMagic.autolimit doesn't add a limit clause to the query

Open takikadiri opened this issue 1 year ago • 3 comments

What happens?

We're using jupysql to query a trino cluster through jupyter notebook. I want to autolimit the query results that we query to the cluster, as it's intended to be used for an interactive analytics workload.

I expect the SqlMagic.autolimit to add a LIMIT clause to the query, but it doesn't, i can see the original query in the Trino Admin UI.

To Reproduce

  • Create a trino sqlalchemy connection. trino_connection = create_engine("trino://trino_cluster....")
  • set an autolimit. %config SqlMagic.autolimit=500
  • Use the connection: %sql trino_connection
  • Run a query: %sql SELECT * FROM BIG_TABLE
  • Get the result sets or See the query in the trino UI. You'll get the SELECT * FROM BIG_TABLE without the LIMIT Clause

OS:

Linux

JupySQL Version:

0.10.10

Full Name:

Takieddine KADIRI

Affiliation:

Société Générale Assurances

takikadiri avatar Mar 29 '24 13:03 takikadiri

yeah, the effect of autolimit is not to add a LIMIT to the statement but to send the query and only fetch the top X results, are you seeing performance issues?

you can see the relevant code here.

the main reason not to add LIMIT, is that some DBs don't support it so we'd have to figure out a way to make it portable. I guess it wouldn't be too hard using sqlglot, but given that we support many databases, adding stuff like this ends up breaking features for some users.

perhaps a more reasonable solution would be to add LIMIT only to the dbs that we're sure they support it, but even detecting which db the user is connected to is challenging and we've encountered edge cases.

bottom line is that we've decided not to modify user-submitted queries to prevent these issues. if you have time, feel free to open a PR, perhaps there is another approach that I'm overlooking

edublancas avatar Mar 29 '24 14:03 edublancas

Ah i see, The doc mislead me. Thank you for the quick feedback.

For the interactive analytics workload, we're also using Apache Superset, that manage to systematically LIMIT the query. That's why i was looking for this in jupysql

Maybe i could use jupysql for a slighly different use case, where the desired behavior is to collect the entire query results for further analysis with pandas or for doing joins with another data sources. I'll tests the user bahaviors and the cluster performance with jupysql and autopandas=True and let you know if the LIMIT clause is really needed here.

takikadiri avatar Mar 30 '24 14:03 takikadiri

sounds good, thanks for your feedback!

edublancas avatar Mar 30 '24 14:03 edublancas