graphql-compiler icon indicating copy to clipboard operation
graphql-compiler copied to clipboard

`@fold` support for SQL

Open chewselene opened this issue 6 years ago • 19 comments

Here's the initial plan for how to handle @fold in SQL. May need additional considerations for interactions with other directives.

https://www.db-fiddle.com/f/cD9aczVx5B8gRiAXavWBhB/3

Let me know what you think @obi1kenobi @bojanserafimov @pmantica1

chewselene avatar Sep 03 '19 16:09 chewselene

I think this is a step in the right direction, thank you for putting it together. A few observations about opportunities for further iteration:

  • FULL JOIN seems like it might not be the right thing there. I think all of those joins should be outer only on one side -- the side that isn't being folded on.

  • I'm also confused about the DISTINCT in the array_agg calls. The comment says something about eliminating duplicates, but I'm not sure what duplicates there are, and why they are being eliminated. Essentially, I'm unable to match up this DISTINCT to any part of the compiler spec for @fold.

  • Also worth considering is how filters get applied, and how _x_count outputting and filtering would work.

obi1kenobi avatar Sep 03 '19 17:09 obi1kenobi

In the end DISTINCT was unnecessary (query 5) since each array aggregation will occur in a sub-select statement. It was join an intermediate thing I tried since the first query produces dupes.

Will look into what type of JOIN to use and filtering.

chewselene avatar Sep 03 '19 18:09 chewselene

Here's the plan for how to handle filters and _x_count:

https://www.db-fiddle.com/f/4EXkttLVoWTdbMZQBsNDDB/1

The basic idea is to have @fold and _x_count be a part of a sub-SELECT statement that is joined with a LEFT JOIN. These fields are then COALESCE-ed to [] or 0, respectively, if a null was returned. Any @filter operations will part of a HAVING clause, which will require that the query have a GROUP BY clause on all fields not used in a @filter.

Next I will be looking at traversals within @fold.

chewselene avatar Sep 04 '19 15:09 chewselene

I'm a bit worried I might have misunderstood the last bit around the HAVING clause and which fields go in the GROUP BY – not sure why the fields in the GROUP BY change on the basis of what we're filtering on. Imagine we define a filter that is always satisfied – the result should be the same whether or not that filter was there, but I am concerned that by changing which fields go in the GROUP BY we might get different results.

Also, does the above cause any issues if we try to both filter and output a given field?

obi1kenobi avatar Sep 04 '19 15:09 obi1kenobi

Yeah, that's my bad in the explanation; this part

which will require that the query have a GROUP BY clause on all fields not used in a @filter.

is wrong. It should be that we GROUP BY all fields in the top most SELECT statement, including those used by a @filter directive. Does that make sense?

chewselene avatar Sep 04 '19 17:09 chewselene

Good job covering all the edge cases :)

Instead of using subqueries to eliminate NULLs, can't we use array_remove(array_agg(Neighborhoods.name), NULL)? This doesn't work in mssql though, and probably most other dialects.

Do we have a plan on how to fold in mssql, or are we targetting postgres first?

bojanserafimov avatar Sep 04 '19 21:09 bojanserafimov

If array_agg and subqueries both work across MSSQL and Postgres, any reason to move to array_remove? Seems to me like we should aim to maximize compatibility for our generated SQL, and array_remove doesn't seem that tempting to me here.

obi1kenobi avatar Sep 04 '19 22:09 obi1kenobi

Actually, I see that the subquery is also needed for the interaction of @filter inside the fold with _x_count. Let's keep it.

bojanserafimov avatar Sep 05 '19 15:09 bojanserafimov

Yeah, that's my bad in the explanation; this part

which will require that the query have a GROUP BY clause on all fields not used in a @filter.

is wrong. It should be that we GROUP BY all fields in the top most SELECT statement, including those used by a @filter directive. Does that make sense?

I'm guessing we have to GROUP BY all fields in the top most select because those fields are not aggregated, but used together with fields that are aggregated.

My concern is that this GROUP BY can do some unintended deduplication. Consider a situation where there's two results that have the exact same results but come from two different vertices, differing only in fields not selected by the query. In that case we want to see duplicate results in the output. If it's not clear what I'm talking about, I can try to make a dbfiddle with this scenario.

One solution to this problem would be to also group by the primary key of every vertex outside the fold scope, even if it is not used.

bojanserafimov avatar Sep 05 '19 17:09 bojanserafimov

Yeah @chewselene and I were just looking at this as well, and I think we ended up removing the outer GROUP BY since it didn't seem to be necessary with the subqueries approach. I updated the fiddle to add a new SQL query with the GROUP BY removed.

obi1kenobi avatar Sep 05 '19 17:09 obi1kenobi

Here's a new fiddle for how traversals inside a fold would work:

https://www.db-fiddle.com/f/d5fdweFyvHQKCdVDq8BeWe/2

I ended up using array_remove. Not sure if it is import to not use that since we already can't use ARRAY_AGG in older versions of MSSQL. What do you think?

chewselene avatar Sep 10 '19 20:09 chewselene

I think the array_remove is problematic for a different reason: it will remove legitimate null entries, and break the "all the output lists from the same fold are the same length" invariant. Consider the situation where you have two outputs from a given @fold, one of which is on a nullable field. I think this would make it possible to construct situations where I get an unexpected result for the two outputs, something like:

{
  "nullable": [1, 2, 3],
  "non_nullable": ["a", "b", "c", "d", "e"]
}

obi1kenobi avatar Sep 16 '19 14:09 obi1kenobi

Good point, @obi1kenobi ! Updated version here: https://www.db-fiddle.com/f/d5fdweFyvHQKCdVDq8BeWe/3 Let me know what you think.

chewselene avatar Sep 16 '19 17:09 chewselene

Yeah, that seems to work! One thing I'm curious about: why do an INNER JOIN with a nested selection, instead of joining to the Streets table directly and turning the WHERE inside it into a HAVING on the outside?

obi1kenobi avatar Sep 16 '19 17:09 obi1kenobi

I don't think HAVING is what we want because we're not using an aggregate to filter on, but you are right that it was needlessly complex - see example 5 on https://www.db-fiddle.com/f/d5fdweFyvHQKCdVDq8BeWe/4 for the simplified version. I think the only aggregate we want to filter on is COUNT, which we've already taken care of in the final WHERE clause so I don't think we'll end up having to use HAVING anywhere.

chewselene avatar Sep 16 '19 17:09 chewselene

Yup, you're totally right – I mixed up which keyword did what in aggregations :) If only there were a compiler that could help me use the right thing... 😝

This looks great to me! @bojanserafimov what do you think?

obi1kenobi avatar Sep 16 '19 20:09 obi1kenobi

https://www.db-fiddle.com/f/4EXkttLVoWTdbMZQBsNDDB/3 I've added a Neighborhood with name = NULL. In the first query, it appears as null inside the array (which is correct), but it's not counted in the _x_count.

Shouldn't we implement _x_count using count(*) to solve this? It's also simpler to emit count(*) from the LocalField than it is to emit count(some_column).

bojanserafimov avatar Sep 17 '19 14:09 bojanserafimov

I was assuming the name was the primary key of Neighborhood and therefore non-nullable. count(*) seems like a good way to avoid the problem entirely though, even in multi-column pkey situations.

obi1kenobi avatar Sep 17 '19 14:09 obi1kenobi

PostgreSQL and MSSQL support is merged into master. Leaving this open for the sake of the other flavors of SQL for now, though I may open a separate issue specifically for those in the future.

obi1kenobi avatar Nov 19 '19 22:11 obi1kenobi