superset icon indicating copy to clipboard operation
superset copied to clipboard

feat: a native SQLAlchemy dialect for Superset

Open betodealmeida opened this issue 3 years ago • 12 comments

SUMMARY

This PR introduces a new SQLAlchemy dialect, superset://, together with a corresponding DB engine spec. The SQLAlchemy dialect is built on top of shillelagh, a framework for building SQLAlchemy dialects and mapping resources to virtual tables so they can be queried via SQL.

With this, users can create a new database using the superset:// SQLAlchemy URI, and use it to write queries like this:

SELECT * FROM "superset.examples.birth_names"

Queries can also join data from multiple databases. Eg, if we have a database called external_database, with a table called events, we can do:

SELECT * FROM "superset.examples.birth_names" 
WHERE num_boys > 
    SELECT 5*MAX(CAST(value AS FLOAT)) FROM "superset.external_database.events"
)

A few comments:

  1. The dialect is somewhat efficient, pushing predicates to the corresponding databases whenever possible.
  2. The dialect uses Superset's security manager to prevent users from accessing unauthorized databases.
  3. DML is supported, as long as DML is enabled in the superset:// database and all the related databases.

HOW DOES IT WORK?

Shillelagh simplifies the process of querying data sources using SQL. Each data source is handled by an adapter that implements the following methods (a simple example):

def get_data(self, bounds: Dict[str, Filter], order: List[Tuple[str, RequestedOrder]]) -> Iterator[Row]:
    """
    Handle `SELECT`ing rows.

    `bounds` is a dictionary of filters that should be applied to columns
    `order` is a list of column names and the order the should be sorted
    `Row` is just an alias for `Dict[str, Any]`
    """
    pass

def insert_row(self, row: Row) -> int:
    """
    Handle `INSERT`ing rows.

    Returns the ID of the newly inserted row.
    """
    pass

def delete_row(self, row_id: int) -> None:
    """
    Handle `DELETE`ing rows.
    """
    pass

def update_row(self, row_id: int, row: Row) -> None:
    """
    Handle `UPDATE`ing rows.

    Optional, if not defined `delete_row` and `insert_row` are called.
    """
    pass

For the Superset adapter, the methods interact with the corresponding database using SQLAlchemy. Everything else (SQL parsing, joins, etc.) is handled by SQLite.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Screenshot_2021-04-16 Superset(1)

TEST PLAN

Added unit tests.

ADDITIONAL INFORMATION

  • [ ] Has associated issue:
  • [ ] Changes UI
  • [ ] Includes DB Migration (follow approval process in SIP-59)
    • [ ] Migration is atomic, supports rollback & is backwards-compatible
    • [ ] Confirm DB migration upgrade and downgrade tested
    • [ ] Runtime estimates and downtime expectations provided
  • [ ] Introduces new feature or API
  • [ ] Removes existing feature or API

betodealmeida avatar Apr 18 '21 23:04 betodealmeida

Sounds interesting! Could you provide some possible Use Cases? Is this a step towards paving the path to a LookML-like modeling layer?

rumbin avatar Apr 19 '21 21:04 rumbin

Sounds interesting! Could you provide some possible Use Cases? Is this a step towards paving the path to a LookML-like modeling layer?

Yeah, that's my vision. Improving the semantic layer in Superset so we can do things like:

  1. Define dimension tables, declare the relationship ("foreign keys") in datasets, and have auto-joins in the Explore view, regardless of where the tables live. There's some performance concerns here, so we'd probably want to limit the cardinality of the dimensions.
  2. Move data between DBs. This dialect supports DML, so you can SELECT from a table in one DB and INSERT into a table in another. This could also help materializing datasets/dimensions in different databases to improve JOINs.
  3. Build connectors to APIs. This dialect is built on top of shillelagh, which helps building SQL connectors to APIs. We can selectively enable in Superset some of the adapters supported by shillelagh. Imagine using a Google Calendar to annotate events on a time series, for example.

betodealmeida avatar Apr 20 '21 02:04 betodealmeida

Thanks for the explanation. I wonder, if all tables of a query built by explore are accessible by the same DB connection -- e.g. different DBs, but the same Snowflake account -- will this method be smart enough to push down the whole query? Our vision is to integrate all relevant sources into Snowflake, in order to have them available for arbitrary client softwares. So, having a modeling layer would be highly welcome, but we would not want to query and merge from different connections.

BTW, is this the right place for such a discussion, or would Slack or the mailing list be more appropriate?

rumbin avatar Apr 20 '21 05:04 rumbin

I wonder, if all tables of a query built by explore are accessible by the same DB connection -- e.g. different DBs, but the same Snowflake account -- will this method be smart enough to push down the whole query?

It would not, but that's an interesting use case. Presto has a similar use case, because we don't support multiple catalogs in a single DB, you need to create a DB per catalog.

betodealmeida avatar Apr 20 '21 05:04 betodealmeida

Codecov Report

Merging #14225 (2dbd811) into master (9c54280) will increase coverage by 0.07%. Report is 6 commits behind head on master. The diff coverage is 88.20%.

:exclamation: Current head 2dbd811 differs from pull request most recent head 0aef8eb. Consider uploading reports for the commit 0aef8eb to get more accurate results

@@            Coverage Diff             @@
##           master   #14225      +/-   ##
==========================================
+ Coverage   69.00%   69.08%   +0.07%     
==========================================
  Files        1906     1907       +1     
  Lines       74149    74303     +154     
  Branches     8211     8208       -3     
==========================================
+ Hits        51169    51331     +162     
+ Misses      20856    20850       -6     
+ Partials     2124     2122       -2     
Flag Coverage Δ
hive 54.11% <40.00%> (-0.06%) :arrow_down:
mysql 79.02% <41.14%> (-0.19%) :arrow_down:
postgres 79.13% <41.14%> (-0.19%) :arrow_down:
presto 54.01% <40.00%> (-0.06%) :arrow_down:
python 83.39% <88.00%> (+0.01%) :arrow_up:
sqlite 77.71% <41.14%> (-0.19%) :arrow_down:
unit 55.22% <88.00%> (+0.17%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
superset-frontend/src/SqlLab/App.jsx 0.00% <ø> (ø)
superset-frontend/src/SqlLab/actions/sqlLab.js 69.62% <ø> (ø)
...d/src/SqlLab/components/ShareSqlLabQuery/index.tsx 81.48% <ø> (ø)
...frontend/src/SqlLab/components/SouthPane/index.tsx 79.54% <ø> (ø)
...frontend/src/SqlLab/components/SqlEditor/index.jsx 52.73% <ø> (ø)
...d/src/SqlLab/components/TabbedSqlEditors/index.jsx 50.00% <ø> (ø)
superset-frontend/src/components/Chart/Chart.jsx 55.35% <ø> (ø)
...erset-frontend/src/components/Chart/chartAction.js 55.55% <ø> (ø)
...end/src/components/Datasource/DatasourceEditor.jsx 65.63% <ø> (ø)
...tend/src/components/Datasource/DatasourceModal.tsx 75.00% <ø> (ø)
... and 49 more

... and 3 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Apr 23 '21 23:04 codecov[bot]

  • implement row limit?

betodealmeida avatar Sep 08 '21 23:09 betodealmeida

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

stale[bot] avatar Apr 16 '22 15:04 stale[bot]

Any update on this?

I think this shillelagh could be the basis of creating connectors to non-relational databases. This is what I need at the moment. I wanted to plot data from a non-relational source, but Superset only understand SQLAlchemy dialects.

danilomo avatar May 17 '22 06:05 danilomo

Any update on this?

No updates yet. I'll address @dpgaspar's issue similarly to how I did https://github.com/apache/superset/pull/19999 and try to get this for review again in the next week.

I think this shillelagh could be the basis of creating connectors to non-relational databases. This is what I need at the moment. I wanted to plot data from a non-relational source, but Superset only understand SQLAlchemy dialects.

That has been the goal, what source are you interested in? I wrote shillelagh mostly for the Gsheets connectivity, but people have used it to connect Superset to graphql and airtable.

betodealmeida avatar May 17 '22 15:05 betodealmeida

Rebased and revisited the PR, improving the unit tests and addressing concerns about g.user.

betodealmeida avatar May 17 '22 17:05 betodealmeida

That has been the goal, what source are you interested in?

Mostly Asam ODS (standard but not open source stuff - https://www.asam.net/standards/detail/ods/) and Elastic Search. I even started to write a library exactly like shillelagh by my own: wrote a sql parser with antlr, then started to write a SQLAlchemy dialect implementation afterward.

danilomo avatar May 18 '22 06:05 danilomo

Mostly Asam ODS (standard but not open source stuff - https://www.asam.net/standards/detail/ods/) and Elastic Search. I even started to write a library exactly like shillelagh by my own: wrote a sql parser with antlr, then started to write a SQLAlchemy dialect implementation afterward.

Take a look at shillelagh then, there's a tutorial on creating a new adapter. shillelagh takes care of all the SQL (parsing, joining, etc.) and you only need to worry about reading data.

betodealmeida avatar May 19 '22 15:05 betodealmeida

This is amazing feature, I can help on test + review + fix conflicts and get this merged.

mayurnewase avatar Mar 23 '23 05:03 mayurnewase

@mayurnewase: This is amazing feature, I can help on test + review + fix conflicts and get this merged.

I'm going to rebase this and give it another try. Maybe we should add some safeguards to prevent huge joins, like a configurable auto-LIMIT on the queries. I think that would address the main concern I've seen about this feature.

betodealmeida avatar Aug 10 '23 00:08 betodealmeida

OK, I think this is ready for it's annual review! 😆

betodealmeida avatar Aug 11 '23 04:08 betodealmeida

@danilomo I think this shillelagh could be the basis of creating connectors to non-relational databases. This is what I need at the moment. I wanted to plot data from a non-relational source, but Superset only understand SQLAlchemy dialects.

This might interest you: https://preset.io/blog/accessing-apis-with-superset/

betodealmeida avatar Aug 11 '23 15:08 betodealmeida

@michael-s-molina do we plan to include this in 3.0?

mdeshmu avatar Aug 14 '23 11:08 mdeshmu

@michael-s-molina do we plan to include this in 3.0?

Given the experimental nature of this feature, I think it's a good idea to give it some time to mature. It will likely be included in a minor release.

michael-s-molina avatar Aug 14 '23 11:08 michael-s-molina

betodealmeida avatar Aug 18 '23 20:08 betodealmeida

@danilomo I think this shillelagh could be the basis of creating connectors to non-relational databases. This is what I need at the moment. I wanted to plot data from a non-relational source, but Superset only understand SQLAlchemy dialects.

This might interest you: https://preset.io/blog/accessing-apis-with-superset/

Thanks a lot, man!

Edit

"merged"

Wait, what? This is fantastic, lol

Congratulations!

danilomo avatar Aug 20 '23 11:08 danilomo

It appears that the latest release, v3.0.1 (which was released two weeks ago), still doesn't include this feature.

However, this feature has been mentioned in https://superset.apache.org/docs/databases/meta-database/

image

When will we have access to this feature?

If I can't wait, how can I build the application with the latest code? Could you please provide instructions?

ozbillwang avatar Nov 03 '23 12:11 ozbillwang