ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: standardize APIs for creating tables from expressions

Open mariano-recurve opened this issue 5 years ago • 5 comments

A common pattern for me is to build up a SQL expression, execute it against the database backend and save the results as a database table. I think it's not currently possible to do this entirely in Ibis? What I generally do is compile the expression to SQL and then execute that using either the BigQuery web interface or the Python client, both of which have options to save the query as a table. It would be really nice if I could do all of this without leaving Ibis.

Some other things I often do:

  • List datasets (Ibis does this)
  • List tables (Ibis does this)
  • Create and delete tables and datasets
  • Execute queries and save the results directly as files on GCS or local disk
  • Copy tables to local disk or Google Cloud Storage, e.g. as CSV, Avro, etc
  • Upload CSV files or dataframes to database tables

I realize this stuff is potentially out of scope, but just thought I'd ask. So far I end up writing a wrapper library around Ibis to do all of this but it strikes me that it could fit really nicely in the main Ibis project. I think some of the Blaze project was focused on use cases like this and I think maybe there's a gap here.

mariano-recurve avatar Aug 18 '20 20:08 mariano-recurve

this is an interesting request - I'd asked for something similar where any expr.execute could support an as parameter, which would leverage the underlying engine's support create table (or even the select ...INTO variant supported by sql server and the like)

so something like

expr.execute(as='newTable', temp=True)

niviksha avatar Aug 27 '20 00:08 niviksha

Bumping this up - It would be great to add the ability to do this esp for OmniSciDB but any backend that supports temporary tables( cc @xmnlab ) - one common pattern I have to run an ibis expression and use that materialized result for the next set of Ibis expressions.

The big advantage of this, is that it allows potentially large intermediate result sets not to have to return client-side, esp if temp tables are supported by the backend - i.e the materialization is controllable by the user via this option

e.g. something like

expr1 = t.groupby(t.a).aggregate(t.count().name('ct'))
result = expr1.execute(as='newTable', temp=True) #this would return a TableExpr wrapping 'newTable'
result.filter(ct > 500) #...etc

#we could even chain this
result = expr1.execute(as='newTable', temp=True).filter(ct>500)

niviksha avatar Dec 06 '20 02:12 niviksha

this is the checkpointing concept - and is not limited to sql

would take a bit of effort to add this but is pretty useful generally

PRs and design docs welcome

jreback avatar Dec 06 '20 02:12 jreback

Currently, there are some ibis backends that implement load_data or insert method, that is used to write into a table.

Ibis OmniSci backend already has an option for temporary table creation using conn.create_table('table_name', schema=schema, is_temporary=True). it was not implemented the creation of a temporary table from an ibis expression, but to achieve that would require small changes

Some idea for this implementation would be:

  • create a method called persist (borrowing the name from dask, but I am open for any name) that will receive the parameters: table_name, is_temporary (that specify if the table is temporary or not) and append (that specify if the data will be stored in an existent table or if a new table should be created);
  • persist compiles and runs the SQL statement for the given expression and returns a TableExpr.

so it will allow a construction like expr2 = expr1.persist('myNewTable', is_temporary=True) and expr2 is a TableExpr that poinst to myNewTable table.

@jreback do you have already any thought for this implementation?

xmnlab avatar Dec 08 '20 02:12 xmnlab

I have an implementation that we will push to open source at some point. however this is actually pretty easy, you just add a .checkpoint() api (and Checkpoint node), which for example could have temp table, file or memory backing stores. then an execute on the expression should just materialize at that point in the graph.

some semantics are TBD, e.g. if you have a checkpoint node in a graph do you always just check for saved state? or is this a special action.

jreback avatar Dec 08 '20 14:12 jreback