druid icon indicating copy to clipboard operation
druid copied to clipboard

useApproximateCountDistinct implicitly makes all aggregates using distinct unplannable

Open kgyrtkirk opened this issue 1 year ago • 8 comments

  • useApproximateCountDistinct supposed to enable a special mode to handle COUNT(DISTINCT x) with skethces
  • the rules are configured to remove the generic distinct handler rules here
  • so queries like select sum(distinct added) from wikipedia will fail to plan if useApproximateCountDistinct is enabled

quidem test:

!set useApproximateCountDistinct false
!use druidtest://?numMergeBuffers=3
!set outputformat mysql

select sum(distinct added) from wikipedia;
+---------+
| EXPR$0  |
+---------+
| 6455074 |
+---------+
(1 row)

!ok

!set useApproximateCountDistinct true
!use druidtest://?numMergeBuffers=3
select sum(distinct added) from wikipedia;
[...]
Missing conversion is LogicalAggregate[convention: NONE -> DRUID]
There is 1 empty subset: rel#105:RelSubset#2.DRUID.[], the relevant part of the original plan is as follows
103:LogicalAggregate(group=[{}], EXPR$0=[SUM(DISTINCT $0)])
  101:LogicalProject(subset=[rel#102:RelSubset#1.NONE.[]], added=[$18])
    74:LogicalTableScan(subset=[rel#100:RelSubset#0.NONE.[]], table=[[druid, wikipedia]])
[...]
QueryInterruptedException{msg=Query could not be planned. A possible reason is [Aggregation [SUM(DISTINCT $18)] is not supported2], code=Unknown exception, class=org.apache.druid.error.DruidException, host=null}
	at org.apache.druid.query.QueryInterruptedException.wrapIfNeeded(QueryInterruptedException.java:113)
[...]

kgyrtkirk avatar Jul 10 '24 08:07 kgyrtkirk

ideally we should first put a patch on this hole to explain the situation to the user; possibly by: adding a check to DruidSqlValidtor or something like that ; to ensure that when approx is enabled only count aggregates could have isDistinct enabled; and describe to disable approx

kgyrtkirk avatar Jul 10 '24 09:07 kgyrtkirk

@kgyrtkirk does you mean if there is DISTINCT inside COUNT and useApproximateCountDistinct = true,then throw a CalciteContextException in SqlValidatorImpl, and describe the message like:" we should disable approx" the in the CalciteContextException.

is that your expected result?

AlbericByte avatar Jul 11 '24 01:07 AlbericByte

something like that; if we have useApproximateCountDistinct only COUNT(DISTINCT x) will work as expecteded other AGG(DISTINCT ...) will produce the above misleading exception about not able to convert the plan...which will kinda leave the user without much clue :D

I think fix could be to rewrite the count(distinct) to the sketch functions explicitly at compilation time and leave the distinct conversion rules enabled at all time

but giving a better error with a hint could give a chance to the users bumping into this to use the feature which is already present :)

kgyrtkirk avatar Jul 11 '24 07:07 kgyrtkirk

@kgyrtkirk next time you can assign the task me, but this is good to learn as well.

AlbericByte avatar Jul 19 '24 06:07 AlbericByte

@AlbericByte: sorry, I haven't seen a PR for it / further comments....next time I'll ask you about it.

right now that PR only restrict it for window functions

However it turned out that the situation is a bit more complicated than I was expecting when I've filed this issue - we can't just ban distinct usage for non-count(distinct) cases - as there are aggregators like string_agg which could do the distinct aggregation by themselves while translating to the native later...

We were talking about introducing some annotation markers to enable the compiler to verify this and restrict it that way.

But I'll reach out to you next time to avoid such a misunderstanding(s) - thank you for letting me know that you was not expecting the above!

kgyrtkirk avatar Jul 19 '24 08:07 kgyrtkirk

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Apr 26 '25 00:04 github-actions[bot]

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

github-actions[bot] avatar May 24 '25 00:05 github-actions[bot]

Still seems relevant, reopening.

gianm avatar May 30 '25 23:05 gianm