`@fold` support for SQL
Here's the initial plan for how to handle @fold in SQL. May need additional considerations for interactions with other directives.
https://www.db-fiddle.com/f/cD9aczVx5B8gRiAXavWBhB/3
Let me know what you think @obi1kenobi @bojanserafimov @pmantica1
I think this is a step in the right direction, thank you for putting it together. A few observations about opportunities for further iteration:
-
FULL JOINseems like it might not be the right thing there. I think all of those joins should be outer only on one side -- the side that isn't being folded on. -
I'm also confused about the
DISTINCTin thearray_aggcalls. The comment says something about eliminating duplicates, but I'm not sure what duplicates there are, and why they are being eliminated. Essentially, I'm unable to match up thisDISTINCTto any part of the compiler spec for@fold. -
Also worth considering is how filters get applied, and how
_x_countoutputting and filtering would work.
In the end DISTINCT was unnecessary (query 5) since each array aggregation will occur in a sub-select statement. It was join an intermediate thing I tried since the first query produces dupes.
Will look into what type of JOIN to use and filtering.
Here's the plan for how to handle filters and _x_count:
https://www.db-fiddle.com/f/4EXkttLVoWTdbMZQBsNDDB/1
The basic idea is to have @fold and _x_count be a part of a sub-SELECT statement that is joined with a LEFT JOIN. These fields are then COALESCE-ed to [] or 0, respectively, if a null was returned. Any @filter operations will part of a HAVING clause, which will require that the query have a GROUP BY clause on all fields not used in a @filter.
Next I will be looking at traversals within @fold.
I'm a bit worried I might have misunderstood the last bit around the HAVING clause and which fields go in the GROUP BY – not sure why the fields in the GROUP BY change on the basis of what we're filtering on. Imagine we define a filter that is always satisfied – the result should be the same whether or not that filter was there, but I am concerned that by changing which fields go in the GROUP BY we might get different results.
Also, does the above cause any issues if we try to both filter and output a given field?
Yeah, that's my bad in the explanation; this part
which will require that the query have a GROUP BY clause on all fields not used in a @filter.
is wrong. It should be that we GROUP BY all fields in the top most SELECT statement, including those used by a @filter directive. Does that make sense?
Good job covering all the edge cases :)
Instead of using subqueries to eliminate NULLs, can't we use array_remove(array_agg(Neighborhoods.name), NULL)? This doesn't work in mssql though, and probably most other dialects.
Do we have a plan on how to fold in mssql, or are we targetting postgres first?
If array_agg and subqueries both work across MSSQL and Postgres, any reason to move to array_remove? Seems to me like we should aim to maximize compatibility for our generated SQL, and array_remove doesn't seem that tempting to me here.
Actually, I see that the subquery is also needed for the interaction of @filter inside the fold with _x_count. Let's keep it.
Yeah, that's my bad in the explanation; this part
which will require that the query have a GROUP BY clause on all fields not used in a @filter.
is wrong. It should be that we
GROUP BYall fields in the top mostSELECTstatement, including those used by a@filterdirective. Does that make sense?
I'm guessing we have to GROUP BY all fields in the top most select because those fields are not aggregated, but used together with fields that are aggregated.
My concern is that this GROUP BY can do some unintended deduplication. Consider a situation where there's two results that have the exact same results but come from two different vertices, differing only in fields not selected by the query. In that case we want to see duplicate results in the output. If it's not clear what I'm talking about, I can try to make a dbfiddle with this scenario.
One solution to this problem would be to also group by the primary key of every vertex outside the fold scope, even if it is not used.
Yeah @chewselene and I were just looking at this as well, and I think we ended up removing the outer GROUP BY since it didn't seem to be necessary with the subqueries approach. I updated the fiddle to add a new SQL query with the GROUP BY removed.
Here's a new fiddle for how traversals inside a fold would work:
https://www.db-fiddle.com/f/d5fdweFyvHQKCdVDq8BeWe/2
I ended up using array_remove. Not sure if it is import to not use that since we already can't use ARRAY_AGG in older versions of MSSQL. What do you think?
I think the array_remove is problematic for a different reason: it will remove legitimate null entries, and break the "all the output lists from the same fold are the same length" invariant. Consider the situation where you have two outputs from a given @fold, one of which is on a nullable field. I think this would make it possible to construct situations where I get an unexpected result for the two outputs, something like:
{
"nullable": [1, 2, 3],
"non_nullable": ["a", "b", "c", "d", "e"]
}
Good point, @obi1kenobi ! Updated version here: https://www.db-fiddle.com/f/d5fdweFyvHQKCdVDq8BeWe/3 Let me know what you think.
Yeah, that seems to work! One thing I'm curious about: why do an INNER JOIN with a nested selection, instead of joining to the Streets table directly and turning the WHERE inside it into a HAVING on the outside?
I don't think HAVING is what we want because we're not using an aggregate to filter on, but you are right that it was needlessly complex - see example 5 on https://www.db-fiddle.com/f/d5fdweFyvHQKCdVDq8BeWe/4 for the simplified version. I think the only aggregate we want to filter on is COUNT, which we've already taken care of in the final WHERE clause so I don't think we'll end up having to use HAVING anywhere.
Yup, you're totally right – I mixed up which keyword did what in aggregations :) If only there were a compiler that could help me use the right thing... 😝
This looks great to me! @bojanserafimov what do you think?
https://www.db-fiddle.com/f/4EXkttLVoWTdbMZQBsNDDB/3
I've added a Neighborhood with name = NULL. In the first query, it appears as null inside the array (which is correct), but it's not counted in the _x_count.
Shouldn't we implement _x_count using count(*) to solve this? It's also simpler to emit count(*) from the LocalField than it is to emit count(some_column).
I was assuming the name was the primary key of Neighborhood and therefore non-nullable. count(*) seems like a good way to avoid the problem entirely though, even in multi-column pkey situations.
PostgreSQL and MSSQL support is merged into master. Leaving this open for the sake of the other flavors of SQL for now, though I may open a separate issue specifically for those in the future.