superset
superset copied to clipboard
feat: a native SQLAlchemy dialect for Superset
SUMMARY
This PR introduces a new SQLAlchemy dialect, superset://
, together with a corresponding DB engine spec. The SQLAlchemy dialect is built on top of shillelagh, a framework for building SQLAlchemy dialects and mapping resources to virtual tables so they can be queried via SQL.
With this, users can create a new database using the superset://
SQLAlchemy URI, and use it to write queries like this:
SELECT * FROM "superset.examples.birth_names"
Queries can also join data from multiple databases. Eg, if we have a database called external_database
, with a table called events
, we can do:
SELECT * FROM "superset.examples.birth_names"
WHERE num_boys >
SELECT 5*MAX(CAST(value AS FLOAT)) FROM "superset.external_database.events"
)
A few comments:
- The dialect is somewhat efficient, pushing predicates to the corresponding databases whenever possible.
- The dialect uses Superset's security manager to prevent users from accessing unauthorized databases.
- DML is supported, as long as DML is enabled in the
superset://
database and all the related databases.
HOW DOES IT WORK?
Shillelagh simplifies the process of querying data sources using SQL. Each data source is handled by an adapter that implements the following methods (a simple example):
def get_data(self, bounds: Dict[str, Filter], order: List[Tuple[str, RequestedOrder]]) -> Iterator[Row]:
"""
Handle `SELECT`ing rows.
`bounds` is a dictionary of filters that should be applied to columns
`order` is a list of column names and the order the should be sorted
`Row` is just an alias for `Dict[str, Any]`
"""
pass
def insert_row(self, row: Row) -> int:
"""
Handle `INSERT`ing rows.
Returns the ID of the newly inserted row.
"""
pass
def delete_row(self, row_id: int) -> None:
"""
Handle `DELETE`ing rows.
"""
pass
def update_row(self, row_id: int, row: Row) -> None:
"""
Handle `UPDATE`ing rows.
Optional, if not defined `delete_row` and `insert_row` are called.
"""
pass
For the Superset adapter, the methods interact with the corresponding database using SQLAlchemy. Everything else (SQL parsing, joins, etc.) is handled by SQLite.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TEST PLAN
Added unit tests.
ADDITIONAL INFORMATION
- [ ] Has associated issue:
- [ ] Changes UI
- [ ] Includes DB Migration (follow approval process in SIP-59)
- [ ] Migration is atomic, supports rollback & is backwards-compatible
- [ ] Confirm DB migration upgrade and downgrade tested
- [ ] Runtime estimates and downtime expectations provided
- [ ] Introduces new feature or API
- [ ] Removes existing feature or API
Sounds interesting! Could you provide some possible Use Cases? Is this a step towards paving the path to a LookML-like modeling layer?
Sounds interesting! Could you provide some possible Use Cases? Is this a step towards paving the path to a LookML-like modeling layer?
Yeah, that's my vision. Improving the semantic layer in Superset so we can do things like:
- Define dimension tables, declare the relationship ("foreign keys") in datasets, and have auto-joins in the Explore view, regardless of where the tables live. There's some performance concerns here, so we'd probably want to limit the cardinality of the dimensions.
- Move data between DBs. This dialect supports DML, so you can
SELECT
from a table in one DB andINSERT
into a table in another. This could also help materializing datasets/dimensions in different databases to improve JOINs. - Build connectors to APIs. This dialect is built on top of shillelagh, which helps building SQL connectors to APIs. We can selectively enable in Superset some of the adapters supported by shillelagh. Imagine using a Google Calendar to annotate events on a time series, for example.
Thanks for the explanation. I wonder, if all tables of a query built by explore are accessible by the same DB connection -- e.g. different DBs, but the same Snowflake account -- will this method be smart enough to push down the whole query? Our vision is to integrate all relevant sources into Snowflake, in order to have them available for arbitrary client softwares. So, having a modeling layer would be highly welcome, but we would not want to query and merge from different connections.
BTW, is this the right place for such a discussion, or would Slack or the mailing list be more appropriate?
I wonder, if all tables of a query built by explore are accessible by the same DB connection -- e.g. different DBs, but the same Snowflake account -- will this method be smart enough to push down the whole query?
It would not, but that's an interesting use case. Presto has a similar use case, because we don't support multiple catalogs in a single DB, you need to create a DB per catalog.
Codecov Report
Merging #14225 (2dbd811) into master (9c54280) will increase coverage by
0.07%
. Report is 6 commits behind head on master. The diff coverage is88.20%
.
:exclamation: Current head 2dbd811 differs from pull request most recent head 0aef8eb. Consider uploading reports for the commit 0aef8eb to get more accurate results
@@ Coverage Diff @@
## master #14225 +/- ##
==========================================
+ Coverage 69.00% 69.08% +0.07%
==========================================
Files 1906 1907 +1
Lines 74149 74303 +154
Branches 8211 8208 -3
==========================================
+ Hits 51169 51331 +162
+ Misses 20856 20850 -6
+ Partials 2124 2122 -2
Flag | Coverage Δ | |
---|---|---|
hive | 54.11% <40.00%> (-0.06%) |
:arrow_down: |
mysql | 79.02% <41.14%> (-0.19%) |
:arrow_down: |
postgres | 79.13% <41.14%> (-0.19%) |
:arrow_down: |
presto | 54.01% <40.00%> (-0.06%) |
:arrow_down: |
python | 83.39% <88.00%> (+0.01%) |
:arrow_up: |
sqlite | 77.71% <41.14%> (-0.19%) |
:arrow_down: |
unit | 55.22% <88.00%> (+0.17%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Files Changed | Coverage Δ | |
---|---|---|
superset-frontend/src/SqlLab/App.jsx | 0.00% <ø> (ø) |
|
superset-frontend/src/SqlLab/actions/sqlLab.js | 69.62% <ø> (ø) |
|
...d/src/SqlLab/components/ShareSqlLabQuery/index.tsx | 81.48% <ø> (ø) |
|
...frontend/src/SqlLab/components/SouthPane/index.tsx | 79.54% <ø> (ø) |
|
...frontend/src/SqlLab/components/SqlEditor/index.jsx | 52.73% <ø> (ø) |
|
...d/src/SqlLab/components/TabbedSqlEditors/index.jsx | 50.00% <ø> (ø) |
|
superset-frontend/src/components/Chart/Chart.jsx | 55.35% <ø> (ø) |
|
...erset-frontend/src/components/Chart/chartAction.js | 55.55% <ø> (ø) |
|
...end/src/components/Datasource/DatasourceEditor.jsx | 65.63% <ø> (ø) |
|
...tend/src/components/Datasource/DatasourceModal.tsx | 75.00% <ø> (ø) |
|
... and 49 more |
... and 3 files with indirect coverage changes
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
- implement row limit?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned
to prevent stale bot from closing the issue.
Any update on this?
I think this shillelagh could be the basis of creating connectors to non-relational databases. This is what I need at the moment. I wanted to plot data from a non-relational source, but Superset only understand SQLAlchemy dialects.
Any update on this?
No updates yet. I'll address @dpgaspar's issue similarly to how I did https://github.com/apache/superset/pull/19999 and try to get this for review again in the next week.
I think this shillelagh could be the basis of creating connectors to non-relational databases. This is what I need at the moment. I wanted to plot data from a non-relational source, but Superset only understand SQLAlchemy dialects.
That has been the goal, what source are you interested in? I wrote shillelagh
mostly for the Gsheets connectivity, but people have used it to connect Superset to graphql and airtable.
Rebased and revisited the PR, improving the unit tests and addressing concerns about g.user
.
That has been the goal, what source are you interested in?
Mostly Asam ODS (standard but not open source stuff - https://www.asam.net/standards/detail/ods/) and Elastic Search. I even started to write a library exactly like shillelagh by my own: wrote a sql parser with antlr, then started to write a SQLAlchemy dialect implementation afterward.
Mostly Asam ODS (standard but not open source stuff - https://www.asam.net/standards/detail/ods/) and Elastic Search. I even started to write a library exactly like shillelagh by my own: wrote a sql parser with antlr, then started to write a SQLAlchemy dialect implementation afterward.
Take a look at shillelagh
then, there's a tutorial on creating a new adapter. shillelagh
takes care of all the SQL (parsing, joining, etc.) and you only need to worry about reading data.
This is amazing feature, I can help on test + review + fix conflicts and get this merged.
@mayurnewase: This is amazing feature, I can help on test + review + fix conflicts and get this merged.
I'm going to rebase this and give it another try. Maybe we should add some safeguards to prevent huge joins, like a configurable auto-LIMIT
on the queries. I think that would address the main concern I've seen about this feature.
OK, I think this is ready for it's annual review! 😆
@danilomo I think this shillelagh could be the basis of creating connectors to non-relational databases. This is what I need at the moment. I wanted to plot data from a non-relational source, but Superset only understand SQLAlchemy dialects.
This might interest you: https://preset.io/blog/accessing-apis-with-superset/
@michael-s-molina do we plan to include this in 3.0?
@michael-s-molina do we plan to include this in 3.0?
Given the experimental nature of this feature, I think it's a good idea to give it some time to mature. It will likely be included in a minor release.
data:image/s3,"s3://crabby-images/6c246/6c24626cae5b897df67d520fb294fb55a1424470" alt=""
@danilomo I think this shillelagh could be the basis of creating connectors to non-relational databases. This is what I need at the moment. I wanted to plot data from a non-relational source, but Superset only understand SQLAlchemy dialects.
This might interest you: https://preset.io/blog/accessing-apis-with-superset/
Thanks a lot, man!
Edit
"merged"
Wait, what? This is fantastic, lol
Congratulations!
It appears that the latest release, v3.0.1 (which was released two weeks ago), still doesn't include this feature.
However, this feature has been mentioned in https://superset.apache.org/docs/databases/meta-database/
When will we have access to this feature?
If I can't wait, how can I build the application with the latest code? Could you please provide instructions?