rfcs User defined aggregate functions

User defined aggregate functions

Open haikyuu opened this issue 4 years ago • 12 comments

Taken from Spectrum.

This is how i first imagined the syntax:

FOR accumulator, variable IN "{" iterator-set [, ...]  "}"
REDUCE output-expr ;

Note that this doesn't allow defining arbitrary aggregate functions, but rather executing them, just like FOR.

Example 1

FOR sentence = "", word IN { "hello", "world" }
REDUCE (
  SELECT result ++ " " ++ word
)

This would give a result of hello world.

We can also access the current index by using _index and the original set using_source

FOR sentence = "", word IN { "hello", "world" }
REDUCE (
  SELECT (
    (<str>(SELECT count(_source)) ++ " ") IF _index = 1 ELSE "")
    ++ result ++ " " ++ word ++ " "
)

Result: 2 hello world

Jun 01 '20 09:06 haikyuu

We can also access the current index by using _index and the original set using _source

Generally, I'd avoid introducing magic variables. If you need the original set, alias it, and index can be obtained with enumerate, e.g:

WITH words := { "hello", "world" }
FOR sentence = "", word IN enumerate(words)
REDUCE (
  SELECT (
    (<str>(SELECT count(words)) ++ " ") IF word.0 = 1 ELSE "")
    ++ sentence ++ " " ++ word.1 ++ " "
)

With that, I think there are two orthogonal problems here: recurrence relations and aggregation.

Recurrence relations are set-returning functions where each subsequent element is defined as a function of preceding elements. The Fibonacci sequence is a classic example of this. These are represented in SQL as recursive CTEs, but we can do better in EdgeQL and actually define them as recursive functions (possibly allowing ad-hoc definitions of lambdas in a query).
Custom aggregation. The standard set of aggregate functions actually covers most common use cases. For example your sentence generation example can already be accomplished by a combination of array_agg and array_join. For cases where custom aggregation is truly needed, I think PostgreSQL's CREATE AGGREGATE would be a good starting point.

Jun 03 '20 19:06 elprans

Recurrence relations are set-returning functions where each subsequent element is defined as a function of preceding elements. The Fibonacci sequence is a classic example of this. These are represented in SQL as recursive CTEs, but we can do better in EdgeQL and actually define them as recursive functions (possibly allowing ad-hoc definitions of lambdas in a query).

And when we have lambda functions https://github.com/edgedb/edgedb/issues/301 we might use them as a mechanism/syntax to write recursive queries.

Jun 03 '20 19:06 1st1

For cases where custom aggregation is truly needed Here is my use case: Intersection

type Library{
  multi link books -> Book;
}

I want to query books of multiple libraries: getBooks($libraryIds).

Jun 03 '20 23:06 haikyuu

I want to query books of multiple libraries

Here's one way of doing it:

SELECT Book
FILTER count(.<books[IS Library) > 1

Jun 04 '20 00:06 elprans

To select all books from a set of libraries given a set of library ids:

SELECT Book
FILTER .<books[IS Library].id IN array_unpack($libraryIds)

The idea here is that when you select a set of Book objects like that, it is a proper set already, there would be no duplicates in it.

Jun 04 '20 00:06 elprans

Thanks for the answer. I will describe my use case more clearly, i think it's not clear yet.

I run a complex query to calculate a set of available TimeSpots for a Class. And i created a class for that function SELECT getClassAvailability(<uuid>$classId, <int64>$maxIn). This function return a SET of TimeSpots.

Now i want to get the availability of multiple Classes. It's actually the TimeSpots that are shared between Classes. That is their Intersection.

Using the REDUCE operator, this would look like this:

FOR availabilities = {}, classId in array_unpack($classIds)
REDUCE (
  WITH classAvailabilities = getClassAvailability(classId, $maxIn)
  SELECT (
    (classAvailabilities FILTER classAvailabilities IN availabilities) 
      IF count(availabilities) > 0 
      ELSE classAvailabilities
)

Jun 04 '20 02:06 haikyuu

As far as i understand, this is not a recurrence relation since each element depends on the whole set and not only the previous element.

Can i express a query like this in the current state of EdgeQL?
Custom aggregate functions are powerful but a bit verbose in Postgres. But it's exactly what i used in my example:

State transition function: WITH classAvailabilities = getClassAvailability(classId, $maxIn)
Aggregate final function:

SELECT (
    (classAvailabilities FILTER classAvailabilities IN availabilities) 
      IF count(availabilities) > 0 
      ELSE classAvailabilities
)

Aggregate state type wasn't specified. Can it be inferred? so that we can use custom aggregate functions internally to implement REDUCE. Or should we specify it?

It's also possible to implement REDUCE before implementing custom aggregate functions: similarly to FOR statement. That would give us a starting point, feedback on the syntax (i personally think Postgres custom aggregate functions syntax is rather verbose, maybe lambda functions would help?) and performance.

Jun 04 '20 09:06 haikyuu

Interesting. We do have a plan to implement a more general partition-and-reduce statement: https://github.com/edgedb/edgedb/issues/104#issuecomment-344307260. It should be possible to express what you're looking for using that syntax:

WITH classId := array_unpack($classIds)
GROUP
    slotTable := (classId, getClassAvailability(classId, $maxIn))
USING
    slotId := slotTable.1
BY
    slotId
INTO
    partitionedSlotTable
UNION
    slotSet := (
        slotId := slotId, 
        classAvailabilityCount := count(partitionedSlotTable.0)
    )
FILTER
    # make sure the slot appears in the set of slots for every class,
    # i.e. an intersection of availability sets.
    slotSet.classAvailabilityCount = len($classIds)

This syntax is admittedly a tad verbose for this particular purpose due to it being very generic (we will likely iterate on it before implementing and/or add some syntax sugar around it).

Basically. because EdgeQL queries have to map 1:1 to SQL queries we are constrained by the type of semantics we can implement efficiently. Specifically, there's no real notion of iteration state in SQL, aside from the aforementioned recursive CTEs and aggregate functions. Another quirk is that those SQL mechanisms work on scalar expressions, not sets, which means your accumulator state must be scalar also.

Jun 04 '20 17:06 elprans

EdgeQL queries have to map 1:1 to SQL queries

Interesting, i didn't know that. Why is this? How do you make sure it's performant?

This syntax is admittedly a tad verbose for this particular purpose

Yes, it is. I spent a dozen of minutes trying to understand it ... but i couldn't. Also couldn't find it in the docs.

USING slotId := slotTable.1

Please note that getClassAvailability returns a set of spots. Does that make my use case not feasible using Group and reduce?

Another quirk is that those SQL mechanisms work on scalar expressions, not sets

Oh! So even if there were a way to define custom aggregate functions, it wouldn't work for my use case? 🤔

That leaves me with EdgePL:

write a function using EdgePL when it's out.
Implement my FOR ... REDUCE operator using EdgePL or a C defined function in postgres (for performance) under the hood

Jun 05 '20 16:06 haikyuu

Interesting, i didn't know that. Why is this?

Preserving a 1:1 query correspondence has a number of important advantages:

Performance. Running multiple SQL queries and reconstructing the result on the client is almost always worse than running an equivalent single query in Postgres. A good example of this are the benchmarks we published for Alpha 1.
Atomicity. If EdgeDB generated multiple SQL queries for an EdgeQL query, those queries would have to be put in an explicit transaction block to be self-consistent (especially important for mutation statements). This transaction would have to span multiple server roundtrips and can potentially be left open for a prolonged amount of time if an EdgeDB thread or task is preempted for a while, and when you have lots of transactions like this you increase the risk of lock contention, serializability conflicts etc.
Simplicity. 1:1 mapping is far easier to manage and reason about on the protocol level as well as observability.

How do you make sure it's performant?

Postgres is actually really good at handling complex queries. It's not the size of the query text that matters, it's the inherent runtime complexity of the operations in it.

Jun 06 '20 17:06 elprans

Yes, it is. I spent a dozen of minutes trying to understand it ... but i couldn't. Also couldn't find it in the docs.

That's because it isn't implemented yet (aside from being able to parse the syntax).

Please note that getClassAvailability returns a set of spots. Does that make my use case not feasible using Group and reduce?

The hypothetical query I posted accounts for that. Remember that in EdgeQL everything is a function over a Cartesian product of the arguments, including the tuple constructor, so

slotTable := (classId, getClassAvailability(classId, $maxIn))

is a set of (classId, timeSlot) tuples. We then partition the set by the slot and count the number of classIds (that would be SELECT count(classId) GROUP BY slot in SQL). We then check that the count of classId equals the number of elements in $classIds to only return the slots available for all classes requested (i.e. the intersection of slot sets).

Jun 06 '20 17:06 elprans

Thank you for your answers @elprans

This partition and reduce statement is really powerful.

we will likely iterate on it before implementing and/or add some syntax sugar around it

I think it's the closest thing to my original idea. Very interesting

Jun 06 '20 23:06 haikyuu

rfcs rfcs copied to clipboard

User defined aggregate functions

Example 1

rfcs
rfcs copied to clipboard