superset fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments

SUMMARY

Fixed issue #28443 Renamed db_engine_specs parameter allows_alias_to_source_column to order_by_allows_alias_to_source_column and added the new parameter group_by_allows_alias_to_source_column.

The previous parameter is used to tell the SQLA generator to rename aliases used in ORDER BY statements with aggregations, to ensure that the source column is referenced. Some engines (e.g. Drill) needs to be able to do the same thing for aliases in GROUP BY statements.

The new parameter is used to tell the SQLA generator to rename any alias of a source column that is used in an aggregation in a GROUP BY statement, to ensure that the source column is referenced.

Added documentation of the new group_by_allows_alias_to_source_column parameter, and fixed errors in the documentation of order_by_allows_alias_to_source_column/(previously)allows_alias_to_source_column

For example this query

SELECT length(n_name) AS n_name
...
GROUP BY length(n_name)

becomes

SELECT length(n_name) AS n_name__
...
GROUP BY length(n_name)

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Screenshot from 2024-05-12 19-37-24

TESTING INSTRUCTIONS

Start Superset
Navigate to the Superset web application and login
Connect to a Drill database
Create a dataset of the Drill database
Create a chart from that dataset
Select visualization type Table with query mode Aggregate
Add two columns in the dimensions
Aggregate the data of one of the columns (e.g. length(column_a))
Press "view query"
The query should now contain a GROUP BY statement of an aggregation, and the alias of that aggregation should have "__" at the end of the name. Example:

SELECT length(n_name) AS n_name__
...
GROUP BY length(n_name)

ADDITIONAL INFORMATION

[X] Has associated issue: Fixes #28443
[ ] Required feature flags:
[ ] Changes UI
[ ] Includes DB Migration (follow approval process in SIP-59)
- [ ] Migration is atomic, supports rollback & is backwards-compatible
- [ ] Confirm DB migration upgrade and downgrade tested
- [ ] Runtime estimates and downtime expectations provided
[ ] Introduces new feature or API
[ ] Removes existing feature or API

May 12 '24 18:05 fhyy

@fhyy thanks for the PR! I believe some other engines have tackled this with the _mutate_label function. For instance, you can check how the ClickHouse spec does this (the method being private is slightly confusing, and this should probably be fixed). Can you check if adding similar logic to the Drill spec would solve this issue?

May 12 '24 18:05 villebro

Codecov Report

Attention: Patch coverage is 62.22222% with 17 lines in your changes missing coverage. Please review.

Project coverage is 70.15%. Comparing base (4720b4f) to head (851f5e5). Report is 567 commits behind head on master.

Files	Patch %	Lines
superset/models/helpers.py	6.25%	15 Missing :warning:
superset/connectors/sqla/models.py	83.33%	2 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #28444      +/-   ##
==========================================
+ Coverage   69.76%   70.15%   +0.39%     
==========================================
  Files        1911     1921      +10     
  Lines       74994    76157    +1163     
  Branches     8353     8353              
==========================================
+ Hits        52316    53427    +1111     
- Misses      20629    20681      +52     
  Partials     2049     2049

Flag	Coverage Δ
hive	`49.01% <42.22%> (?)`
mysql	`?`
postgres	`77.14% <48.88%> (-0.88%)`	:arrow_down:
presto	`53.59% <60.00%> (?)`
python	`83.30% <62.22%> (+0.39%)`	:arrow_up:
sqlite	`76.59% <48.88%> (-0.87%)`	:arrow_down:
unit	`58.74% <46.66%> (+1.96%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

May 12 '24 18:05 codecov[bot]

Thanks @fhyy for the PR. My thinking was the underlying SQLAlchemy dialect should handle the various nuances in terms of how/where aliases are defined.

May 13 '24 16:05 john-bodley

CC @cgivre

May 13 '24 21:05 rusackas

Thanks for the quick feedback!

@villebro I tried the _mutate_label function and it also solves the issue, but I'm not sure if I agree that this is the solution to the actual problem. If I understand that function correctly, it is used to modify labels so that they are compatible with the different engines. An engine could require lowercase letters, or starting with an underscore, etc.

Drill supports the aliases as they are, it just behaves unexpectedly with GROUP BY statements as Drill added support for the use of aliases here.

I agree with @john-bodley that the dialects should handle it instead. A boolean parameter isn't as flexible as it may need to be. In this case I would also propose updating the previous allows_alias_to_source_column parameter as well.

Thoughts on this?

May 15 '24 12:05 fhyy

I know I'm a little late to this conversation, and I'm happy to help out where I can, but I'm very unclear as to what the actual issue is. Is the fact that Drill supports aliases in GROUP BY breaking things?

May 15 '24 13:05 cgivre

Yes. The fact that Drill supports aliases in GROUP BY makes this example query ambiguous:

SELECT length(n_name) AS n_name
FROM
  (select * from cp.`tpch/nation.parquet`)
GROUP BY length(n_name)
LIMIT 10;

Drill will use the alias n_name in GROUP BY length(n_name) which results in the actual statement GROUP BY length(length(n_name)).

Renaming such aliases solves the problem, but not all aliases needs to be renamed.

You can find more information in issue #28443

May 15 '24 14:05 fhyy

@cgivre : See my comment on issue 20349 for what the problem is with GROUP BY in Apache Drill.

Jun 03 '24 05:06 mbrannstrom

Closing this for now. I'm working on fixing the issues with the tests and also updating my fork to the latest version. I managed to break this PR in the process...

Jun 30 '24 14:06 fhyy

I created a new pull request with a slightly different implementation and based of the newer version: #29455

Jul 03 '24 08:07 fhyy