druid icon indicating copy to clipboard operation
druid copied to clipboard

Aggregate functions on string columns

Open GWphua opened this issue 1 year ago • 1 comments

Fixes #16956.

Description

This PR introduces min() and max() aggregators for string columns: StringMinAggregator and StringMaxAggregator, and their respective buffer and vector aggregators. These aggregators compute the minimum and maximum string values respectively in a column during query execution.

I will start with a disclaimer before I go on listing what I have implemented. During my implementation of the aggregation functions, I have trouble understanding what how the VectorAggregator functions in Druid. For now, I have tried my best to piece the VectorAggregator classes together by referencing the StringAnyVectorAggregator class. Nevertheless, I would appreciate some pointers in how SingleValueDimensionValueSelector + MultiValueDimensionVectorSelector + aggregateMultipleValues fields are used in StringAnyVectorAggregator. Furthermore, I have not worked with expression enough to know whether we should include its use in the string aggregations (I see that StringAny uses aggregateMultipleValues but stringLast does not, and both string aggregators do not use expression, so let me know if that is not intended in stringMin as well). I would also appreciate any extra attention in the reviewing process of the VectorAggregator classes.

On the documentation side, I also note the mention of longMin / longMax under moving-average-query.md. Weirdly enough, I do not see any mention of float aggregators, so it led me to think that its only specific to double and long. Do let me know if my string aggregators should be added there.

Regarding the issue #11659 that is linked to #16956, I did some testing, and while my string aggregators can run on datasources:

SELECT min("user")
FROM "wikipedia"

Results:

*feridiák

Running the following will still not work.

SELECT
  min("start"), max("end")
FROM sys.segments

Results:

Error: RUNTIME_FAILURE (OPERATOR)
java.lang.reflect.InvocationTargetException
java.lang.RuntimeException

With that out of the way, let's go into the description of my implementations.


Add StringMin and StringMax Aggregators

  • Implemented StringMinAggregator to find the minimum string value in a column.
  • Implemented StringMaxAggregator to find the maximum string value in a column.
  • Implemented StringMinBufferAggregator and StringMaxBufferAggregator for efficient in-memory aggregation.
  • Implemented StringMinVectorAggregator and StringMaxVectorAggregator for vectorized query execution.

Setting up Aggregator Factory classes

  • Implemented StringMinAggregatorFactory and StringMaxAggregatorFactory to create the respective aggregators.
  • Allow creation of StringMinAggregatorFactory and StringMaxAggregatorFactory under createMinAggregatorFactory and createMaxAggregatorFactory respectively.
  • Change logic of Aggregations#getArgumentsForSimpleAggregator to accommodate the new String aggregation functions.
  • Add new cache key IDs under AggregatorUtils
  • Add the new aggregators as modules to AggregatorsModule.

Unit Tests

  • Added unit tests for all above classes, except VectorAggregators.
  • Added the new aggregator factory classes in AggregatorFactoryTest.
  • Remove AGGREGATION_NOT_SUPPORT_TYPE annotations under DrillWindowQueryTest to support testing of new aggregators.
  • Add new tests under DrillWindowQueryTest

Release note

New: You can now query string columns with min() and max() functions.


Key changed/added classes in this PR
  • StringMinAggregator
  • StringMaxAggregator
  • StringMinBufferAggregator
  • StringMaxBufferAggregator
  • StringMinVectorAggregator
  • StringMaxVectorAggregator
  • StringMinAggregatorFactory
  • StringMaxAggregatorFactory
  • StringMinAggregatorTest
  • StringMaxAggregatorTest
  • StringMinBufferAggregatorTest
  • StringMaxBufferAggregatorTest
  • StringMinVectorAggregatorTest
  • StringMaxVectorAggregatorTest
  • StringMinAggregatorFactoryTest
  • StringMaxAggregatorFactoryTest
  • aggregations.md

This PR has:

  • [x] been self-reviewed.
  • [x] added documentation for new or modified features or behaviors.
  • [x] a release note entry in the PR description.
  • [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • [x] added or updated version, license, or notice information in licenses.yaml
  • [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • [x] added integration tests.
  • [x] been tested in a test Druid cluster.

GWphua avatar Sep 30 '24 10:09 GWphua

Thanks for making the PR! I'll try to review this soon.

adarshsanjeev avatar Oct 08 '24 05:10 adarshsanjeev

Hello @adarshsanjeev, would like to ping and see how's it going 😄

GWphua avatar Nov 04 '24 01:11 GWphua

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Jan 04 '25 00:01 github-actions[bot]

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Mar 24 '25 00:03 github-actions[bot]

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Apr 21 '25 00:04 github-actions[bot]