zed icon indicating copy to clipboard operation
zed copied to clipboard

min/max aggregate functions should support non-numeric types

Open philrz opened this issue 1 year ago • 4 comments

The SQL spec indicates the MIN and MAX aggregate functions should work with many data types, but these functions in SuperDB currently only support numeric types.

Details

Repro is with super commit ec4a26f. This issue was spotted because ClickBench queries 21 & 22 happen to invoke MIN(URL) where URL is a string type.

These aggregate functions currently work fine with numbers.

$ super -version
Version: v1.18.0-230-gec4a26f8

$ echo '{"a":1} {"a":2} {"a":3}' | super -c "select max(a) from '/dev/stdin'"
{max:3}

However, with a non-numeric type such as strings, we return null.

$ echo '{"a":"foo"} {"a":"bar"} {"a":"baz"}' | super -c "select max(a) from '/dev/stdin'"
{max:null}

Meanwhile a SQL solution like DuckDB does return a result.

$ duckdb --version
v1.1.3 19864453f7

$ echo '{"a":"foo"} {"a":"bar"} {"a":"baz"}' | duckdb -c "select max(a) from read_json('/dev/stdin')"
┌─────────┐
│ max(a)  │
│ varchar │
├─────────┤
│ foo     │
└─────────┘

A reading of the 1992 SQL spec shows in section 6.5 that the results of MIN and MAX should be determined using the logic from section 8.2 on "comparison predicates", which in turn lists out the treatments for the many different data types including strings and many others.

philrz avatar Jan 17 '25 20:01 philrz

In a group discussion it was pointed out that the challenge for us vs. the incumbent SQL solutions is that we don't have the luxury of always assuming values of a single type are being aggregated. Therefore solving this correctly requires defining the "total order" among types. See #4545 for another use case affected by this.

philrz avatar Jan 21 '25 19:01 philrz

In another review of this topic, short of the comprehensive "total order" fix, @mccanne acknowledged we might be able to fit in some targeted enhancements before the first GA SuperDB release, e.g., come up with a way for min/max to support string types.

philrz avatar May 05 '25 16:05 philrz

Linked PR #5886 contains an enhancement that somewhat improves the situation here by adding support for the string type. This is verified below in super commit d349e41 by repeating the query shown above and showing how it now returns the "max string" based on alphabetic order rather than the null we saw before.

$ super -version
Version: d349e41bc

$ echo '{"a":"foo"} {"a":"bar"} {"a":"baz"}' | super -c "select max(a) from '/dev/stdin'"
{"max(a)":"foo"}

The comprehensive solution based on "total order" is expected to come later, so this issue is being held open pending that effort.

philrz avatar May 12 '25 20:05 philrz

In bug #5886, I mentioned how I originally attempted to work around this problem with an approach that used lateral subqueries, but that panic blocked me. Now that the bug has been fixed, I can revisit that workaround, although now that we've since added min/max support for the string type some of the original urgency is gone.

To extract the relevant essence of ClickBench query q21, here's a result that uses its pure SQL approach using the MIN aggregate function to find the alphabetically lowest URLs per search phrase that mention "google" and grab the first 10 of those when sorted by search phrase.

$ super -version
Version: f016b909b

$ SUPER_VAM=1 super -s -c "
SELECT SearchPhrase,
       MIN(URL) AS min_URL
FROM 'hits.csup'
WHERE URL LIKE '%google%'
  AND SearchPhrase <> ''
GROUP BY SearchPhrase
ORDER BY SearchPhrase
LIMIT 10;"

{SearchPhrase:"1n540v обзор картиру в ижевское",min_URL:"http://smeshariki.ru/index.ua/b_u_kolonkini/kenwoodki-bez-tebel_tatus=1&ru=1&encoding=google.ru/kategory&op_categoriya/zhira/phones"}
{SearchPhrase:"2 модельфе новостове-на-дону родинозавродробнаружна",min_URL:"http://smeshariki.ru/index.ua/auto_id=0&matched_country=-1&tv=-1&planet.ru%26bn%3D0%26ref%3D%26CompPath%3Dhttp://komandirovarenok.ru/googlead%26versiade-butto_uaz_s_35055&back.aspx?sort/stahion"}
{SearchPhrase:"3dnewsru.com/book attamentialized пермы правито много домашних бутылок должность",min_URL:"http://video.yandex.php?com=google.ru/arts/searchAutoSearch"}
{SearchPhrase:"7 психику без",min_URL:"http://gidroabrados/m234/page.google&kind=комая таблица:Рецепты_коктебе"}
{SearchPhrase:"abkvs",min_URL:"http://auto.ru/catalog/household_apps.google"}
{SearchPhrase:"actoria.ua/prodigy - night sony запрыгунки",min_URL:"http://smeshariki.ru/index.ru/real-estate.google.ru/stellect[170695,905764"}
{SearchPhrase:"beyonceptions смотреть",min_URL:"http://smeshariki.ru/index.ru/recipes/category_name=1&state/out-of-town/household_apps.google.ru/orb/event=little&category"}
{SearchPhrase:"c#+квести на андрей мобиля шурик",min_URL:"http://smeshariki.ru/index.ru/jobs-education=google%2F22.0) AppleWebKit%2F4.0 (Windows NT 5.1"}
{SearchPhrase:"dave kino 2013 года в ростопримеча",min_URL:"http:%2F%2Fsapozhki-advertime-2/#page.google/dodge"}
{SearchPhrase:"dave kino 2013 году парфюм в",min_URL:"http:%2F%2Fsapozhki-advertime-2/#page.google/dodge"}

And here's the workaround that doesn't use MIN. Instead all the filtering and grouping are performed as a first step, then a pipe-based lateral subquery performs the functional equivalent of MIN by sorting arrays of each URL grouping and picking off the first value of each post-sort, producing the same output.

$ SUPER_VAM=1 super -s -c "
SELECT SearchPhrase,
       URL
FROM 'hits.csup'
WHERE URL LIKE '%google%'
  AND SearchPhrase <> ''
GROUP BY SearchPhrase,
         URL
| URLs := collect(URL) by SearchPhrase
| unnest {SearchPhrase, URLs} into (
  sort URLs
  | head 1
  | values {SearchPhrase, min_URL:URLs}
)
| sort SearchPhrase
| head 10"

{SearchPhrase:"1n540v обзор картиру в ижевское",min_URL:"http://smeshariki.ru/index.ua/b_u_kolonkini/kenwoodki-bez-tebel_tatus=1&ru=1&encoding=google.ru/kategory&op_categoriya/zhira/phones"}
{SearchPhrase:"2 модельфе новостове-на-дону родинозавродробнаружна",min_URL:"http://smeshariki.ru/index.ua/auto_id=0&matched_country=-1&tv=-1&planet.ru%26bn%3D0%26ref%3D%26CompPath%3Dhttp://komandirovarenok.ru/googlead%26versiade-butto_uaz_s_35055&back.aspx?sort/stahion"}
{SearchPhrase:"3dnewsru.com/book attamentialized пермы правито много домашних бутылок должность",min_URL:"http://video.yandex.php?com=google.ru/arts/searchAutoSearch"}
{SearchPhrase:"7 психику без",min_URL:"http://gidroabrados/m234/page.google&kind=комая таблица:Рецепты_коктебе"}
{SearchPhrase:"abkvs",min_URL:"http://auto.ru/catalog/household_apps.google"}
{SearchPhrase:"actoria.ua/prodigy - night sony запрыгунки",min_URL:"http://smeshariki.ru/index.ru/real-estate.google.ru/stellect[170695,905764"}
{SearchPhrase:"beyonceptions смотреть",min_URL:"http://smeshariki.ru/index.ru/recipes/category_name=1&state/out-of-town/household_apps.google.ru/orb/event=little&category"}
{SearchPhrase:"c#+квести на андрей мобиля шурик",min_URL:"http://smeshariki.ru/index.ru/jobs-education=google%2F22.0) AppleWebKit%2F4.0 (Windows NT 5.1"}
{SearchPhrase:"dave kino 2013 года в ростопримеча",min_URL:"http:%2F%2Fsapozhki-advertime-2/#page.google/dodge"}
{SearchPhrase:"dave kino 2013 году парфюм в",min_URL:"http:%2F%2Fsapozhki-advertime-2/#page.google/dodge"}

Some may find this approach useful while waiting on the comprehensive "total order" approach that would provide MIN/MAX support for all types. It may also be useful for use cases where picking out a single value is not as simple as finding the MIN, MAX, or the result of some other "stock" aggregate function, i.e., the user would be free to do arbitrarily complex sorting or other processing of the groups inside the lateral subquery.

philrz avatar Jul 18 '25 21:07 philrz