fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments
SUMMARY
Fixed issue #28443
Renamed db_engine_specs parameter allows_alias_to_source_column to order_by_allows_alias_to_source_column and added the new parameter group_by_allows_alias_to_source_column.
The previous parameter is used to tell the SQLA generator to rename aliases used in ORDER BY statements with aggregations, to ensure that the source column is referenced. Some engines (e.g. Drill) needs to be able to do the same thing for aliases in GROUP BY statements.
The new parameter is used to tell the SQLA generator to rename any alias of a source column that is used in an aggregation in a GROUP BY statement, to ensure that the source column is referenced.
Added documentation of the new group_by_allows_alias_to_source_column parameter, and fixed errors in the documentation of order_by_allows_alias_to_source_column/(previously)allows_alias_to_source_column
For example this query
SELECT length(n_name) AS n_name
...
GROUP BY length(n_name)
becomes
SELECT length(n_name) AS n_name__
...
GROUP BY length(n_name)
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
- Start Superset
- Navigate to the Superset web application and login
- Connect to a Drill database
- Create a dataset of the Drill database
- Create a chart from that dataset
- Select visualization type Table with query mode Aggregate
- Add two columns in the dimensions
- Aggregate the data of one of the columns (e.g. length(column_a))
- Press "view query"
- The query should now contain a GROUP BY statement of an aggregation, and the alias of that aggregation should have "__" at the end of the name. Example:
SELECT length(n_name) AS n_name__
...
GROUP BY length(n_name)
ADDITIONAL INFORMATION
- [X] Has associated issue: Fixes #28443
- [ ] Required feature flags:
- [ ] Changes UI
- [ ] Includes DB Migration (follow approval process in SIP-59)
- [ ] Migration is atomic, supports rollback & is backwards-compatible
- [ ] Confirm DB migration upgrade and downgrade tested
- [ ] Runtime estimates and downtime expectations provided
- [ ] Introduces new feature or API
- [ ] Removes existing feature or API
@fhyy thanks for the PR! I believe some other engines have tackled this with the _mutate_label function. For instance, you can check how the ClickHouse spec does this (the method being private is slightly confusing, and this should probably be fixed). Can you check if adding similar logic to the Drill spec would solve this issue?
Codecov Report
Attention: Patch coverage is 62.22222% with 17 lines in your changes missing coverage. Please review.
Project coverage is 70.15%. Comparing base (
4720b4f) to head (851f5e5). Report is 567 commits behind head on master.
| Files | Patch % | Lines |
|---|---|---|
| superset/models/helpers.py | 6.25% | 15 Missing :warning: |
| superset/connectors/sqla/models.py | 83.33% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #28444 +/- ##
==========================================
+ Coverage 69.76% 70.15% +0.39%
==========================================
Files 1911 1921 +10
Lines 74994 76157 +1163
Branches 8353 8353
==========================================
+ Hits 52316 53427 +1111
- Misses 20629 20681 +52
Partials 2049 2049
| Flag | Coverage Δ | |
|---|---|---|
| hive | 49.01% <42.22%> (?) |
|
| mysql | ? |
|
| postgres | 77.14% <48.88%> (-0.88%) |
:arrow_down: |
| presto | 53.59% <60.00%> (?) |
|
| python | 83.30% <62.22%> (+0.39%) |
:arrow_up: |
| sqlite | 76.59% <48.88%> (-0.87%) |
:arrow_down: |
| unit | 58.74% <46.66%> (+1.96%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks @fhyy for the PR. My thinking was the underlying SQLAlchemy dialect should handle the various nuances in terms of how/where aliases are defined.
CC @cgivre
Thanks for the quick feedback!
@villebro I tried the _mutate_label function and it also solves the issue, but I'm not sure if I agree that this is the solution to the actual problem. If I understand that function correctly, it is used to modify labels so that they are compatible with the different engines. An engine could require lowercase letters, or starting with an underscore, etc.
Drill supports the aliases as they are, it just behaves unexpectedly with GROUP BY statements as Drill added support for the use of aliases here.
I agree with @john-bodley that the dialects should handle it instead. A boolean parameter isn't as flexible as it may need to be. In this case I would also propose updating the previous allows_alias_to_source_column parameter as well.
Thoughts on this?
I know I'm a little late to this conversation, and I'm happy to help out where I can, but I'm very unclear as to what the actual issue is. Is the fact that Drill supports aliases in GROUP BY breaking things?
Yes. The fact that Drill supports aliases in GROUP BY makes this example query ambiguous:
SELECT length(n_name) AS n_name
FROM
(select * from cp.`tpch/nation.parquet`)
GROUP BY length(n_name)
LIMIT 10;
Drill will use the alias n_name in GROUP BY length(n_name) which results in the actual statement GROUP BY length(length(n_name)).
Renaming such aliases solves the problem, but not all aliases needs to be renamed.
You can find more information in issue #28443
@cgivre : See my comment on issue 20349 for what the problem is with GROUP BY in Apache Drill.
Closing this for now. I'm working on fixing the issues with the tests and also updating my fork to the latest version. I managed to break this PR in the process...
I created a new pull request with a slightly different implementation and based of the newer version: #29455