ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: Add ability to convert SQL to Ibis expressions

Open prmoore77 opened this issue 3 years ago • 13 comments

Ibis is a great tool for representing SQL logic in Python expressions. Currently - the translation goes one way: from Ibis (Python) to SQL expressions that get compiled and executed in the backend of choice.

It would be amazing if Ibis offered the ability to translate SQL code into Ibis expressions. It would help teams that want to change Database back-ends (from Snowflake to BigQuery for example) - namely by allowing them to translate their SQL to Ibis - and just change the back-end database. They would then "future-proof" their applications by leveraging Ibis to be database-agnostic.

Such a feature would make it easy for teams to adopt Ibis - and provide a turn-key experience for those wishing to translate large SQL workflows.

The input for this function would simply be a SQL string and the target back-end (for dialect purposes). The output would be a string (or file) with a python module containing all of the Ibis code necessary to replicate the SQL provided as input.

prmoore77 avatar Aug 08 '22 13:08 prmoore77

Thanks for raising this @prmoore77 -- we've been thinking about this a little bit.

There are some language traps we should be careful around. To me, an "ibis expression" is the result of executing some Python code -- it is the object that we actually then compile to SQL and execute against a given backend. And we do have a project to convert SQL to Ibis expressions, which is ibis-substrait.

I'm not sure what the right name is for what you're suggesting, but I think you want to be able to go from SQL written out in a file, to Python written out in a file, e.g.

SELECT * FROM lineitem LIMIT 10;

would translate to

expr = con.table("lineitem").limit(10)

I think the best way for us to do this is probably to add the ability for Ibis itself to generate a quine-like representation of itself, so that the expr above as a Python object could spit out the Python code that created it.

If we have that, then SQL string -> Ibis string would be doable as SQL String -> Substrait (via duckdb or isthmus) -> Ibis-Substrait -> Ibis Expression -> Ibis String

gforsyth avatar Aug 08 '22 13:08 gforsyth

There might be an easy win for a subset of use cases here.

@prmoore77 What if .sql could automatically convert the SQL string to the appropriate backend dialect? This is possible with sqlglot already, we'd just need to bake that into the various places that the .sql API hits.

That would at least allow people to take whatever their existing SQL is and use it with a possibly-different backend.

cpcloud avatar Aug 08 '22 13:08 cpcloud

It would be pretty cool to do something like this:

import ibis

sql = """
SELECT
    *
FROM
    lineitem
LIMIT
    5

result = ibis.postgres.convert_sql(sql)

print(type(result))
print(result)
"""

out:

str
"conn.table('lineitem').limit(5)"

I don't know where conn would come from though. Maybe just a default convention ?

I'd imagine this would be useful for one-time conversions so people aren't relying too much on .sql. A job could be run once to convert workflows to ibis via with open(): / write.

edit: this might also just be a side/related project and not something that actually goes into ibis

p-a-a-a-trick avatar Aug 08 '22 13:08 p-a-a-a-trick

https://github.com/zbrookle/sql_to_ibis

this already exists and supports most of the syntax

jreback avatar Aug 08 '22 23:08 jreback

https://github.com/zbrookle/sql_to_ibis

this already exists and supports most of the syntax

Looking at that repo - it appears that it does convert SQL to an Ibis expression - but I don't believe it generates Ibis code. I may be mis-understanding, however. I'll pull the repo and try to see if it does indeed allow you to generate an Ibis expression...

In the MWE in the Readme - it appears they are just using a single table. It is not clear that they support Joins, etc. (at least at my cursory glance)...

prmoore77 avatar Aug 09 '22 12:08 prmoore77

@prmoore77 Does #4335 cover you desired use case? If so, would you mind closing out this issue and we can continue the discussion there?

cpcloud avatar Aug 15 '22 13:08 cpcloud

@cpcloud - Not exactly. #4335 does cover the 2nd half of the request (converting Ibis expressions to Python code which would reproduce the expression) - but it doesn't cover converting SQL to Ibis expressions in the first place.

My request would allow for automated conversion of SQL to Ibis code. Ideally it would automate tasks - like the work that @gforsyth did to convert the TPC-H SQL queries to Ibis code.

So I do think this is linked to #4335 - but it isn't 100% encompassed by it... Are you ok if we leave this open?

prmoore77 avatar Aug 15 '22 15:08 prmoore77

Ex. spark uses ANTLR to parse the sql string and convert it to the scala tree:

  • https://github.com/apache/spark/blob/efc1e8ac8bc61872601ac2244629a9d54f8889fb/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
  • https://github.com/apache/spark/blob/35d00df9bba7238ad4f409999617fae4d04ddbfd/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

and thus allowing the internal tree to be defined both via API AND sql string, BUT spark supports ONLY this single "spark sql" dialect (which they say it's an adaption of Presto's grammar).

One could theoretically take the above .g4 files (with some changes as they have a bit of scala code hardcoded there) and use ANTLR to generate a python lexer/parser (see ANTLR python-target) that gives us an ibis tree, but all this effort will add ONLY "spark sql" dialect support in ibis.

Supporting all sql dialects would be... a HUGE effort.

My 2 cents: ibis could have, a) like spark, it's own "ibis sql" dialect (or use sparks') with a python lexer/parser that generates the ibis tree from a sql string AND b) something else (like sqlglot, jooq) could convert the sql string from other dialects to the single-supported "ibis sql" (with some conversion losses).

P.S.: few years ago I found there is this wonderful niche called "language engineering" - there are these super-skilled "language engineers" that make a living by creating Domain Specific Languages (like SQL), together with lexers, parsers and tooling (think autocomplete and throwing nice errors) - ex. @ftomassetti - and these people bring a lot of value in converstions like this -> should ibis decide to have its own sql dialect

ams1 avatar Sep 20 '22 07:09 ams1

@ams1 Yeah! We've been thinking about using sqlglot for this, since we're already using it and it's a full featured no-dependency SQL parser for Python.

cpcloud avatar Sep 20 '22 09:09 cpcloud

if sqlglot will directly return the ibis tree, it will become a dependency for ibis

image

ams1 avatar Sep 20 '22 11:09 ams1

hi @ams1 , thank you for linking me to the conversation, so I got a chance to learn about ibis :)

ftomassetti avatar Sep 20 '22 14:09 ftomassetti

@ams1 Nice graph :)

Yep, sqlglot is already a dependency of ibis.

cpcloud avatar Sep 20 '22 15:09 cpcloud

I was wondering how I missed the elephant in the house, but I now see sqlglot was "in a corner of the backyard" 😄 -> used only in:

ibis/backends/base/sql/registry/timestamp.py
  90, 12:     import sqlglot as sg

ibis/common/pretty.py
  70, 12:     import sqlglot
  102, 17:     (pretty,) = sqlglot.transpile(

The diagram was created "here" inside the github comment using the mermaid markdown - since recently, github renders mermaid automatically (!) - but as I think it's rendered only inside a browser github and not inside the github mobile app, I usually take a screenshot and paste the picture

ams1 avatar Sep 20 '22 19:09 ams1

This will be an interesting "code episode" to watch as sqlglot has its own "forest" of trees (parsed, optimized ...), its own rules...

from sqlglot.planner import Plan
from sqlglot import parse_one
from sqlglot.optimizer import optimize, RULES
SCHEMA = {
    "df": {
        "a": "uint64",
        "b": "uint64",
    }
}
sqlglot_expr_tree = parse_one("SELECT a FROM df where b < 5")
# RULES = (
#     qualify_tables,
#     isolate_table_selects,
#     qualify_columns,
#     pushdown_projections,
#     normalize,
#     unnest_subqueries,
#     expand_multi_table_selects,
#     pushdown_predicates,
#     optimize_joins,
#     eliminate_subqueries,
#     merge_derived_tables,
#     quote_identities,
# )
sqlglot_optimized_expr_tree = optimize(sqlglot_expr_tree, SCHEMA, rules=RULES)
sqlglot_plan = Plan(sqlglot_optimized_expr_tree)

Looking forward to see how the .transform(to_ibis_tree) will be done.

ams1 avatar Sep 22 '22 07:09 ams1

We've made meaningful progress on this, closing.

cpcloud avatar Jan 30 '23 15:01 cpcloud

xref: https://github.com/ibis-project/ibis/pull/4853

gforsyth avatar Jan 30 '23 15:01 gforsyth