Computed relationships - First-class computed/virtual columns - a generalization of foreign-key-based resource embedding
Problem
This builds on some ideas we had earlier:
@steve-chavez understood resource embedding as functions:
So, currently we support embedding - or joining - with the embed(col1,col2) syntax. A join is an operator(according to relational algebra) and thus: a function.
And I gave a practical example, showing the similarities:
create function clients (projects) returns clients stable language sql as $$ select * from clients where id = $1.client_id $$;Because this function has only 1 argument of type projects it can be selected as a virtual column from the projects table. The url would look like this:
GET /projects?select=*,clients. The result is the same as the embedding we provide:GET /projects?select=*,clients(*).
@steve-chavez then suggested this as the way forward to do custom embeddings.
However, the current support for virtual columns is only implicit through field notation. This has several drawbacks for embedding, i.e. the following can only be done through fk-based embedding right now:
- ability to select only a subset of columns
- nesting
- performance: inlining of o2m and m2m queries (doesn't work with a real function, because the aggregation would prevent it)
Additionally it's currently impossible to use a virtual column that has the same name as a real column of the same relation - field notation will always choose the latter.
Basic implementation
To solve these issues, I propose to introduce first-class support for virtual columns: When creating the schema cache, we should parse function definitions to detect virtual columns and then make them callable the same way as resource embedding is currently done.
The good thing: This is actually much simpler than it sounds. We need to add relationships based on function definitions in:
https://github.com/PostgREST/postgrest/blob/8934190d4359bdf3597339fd4da4b4b250c8dbb9/src/PostgREST/DbStructure.hs#L96
Those relationships would have a cardinality of M2O or O2M depending whether the function returns multiple rows or not. This would generate the right query for aggregation here:
https://github.com/PostgREST/postgrest/blob/8934190d4359bdf3597339fd4da4b4b250c8dbb9/src/PostgREST/Query/QueryBuilder.hs#L52-L81
We would then extend the Cardinality type to hold either the FKConstraint or a QualifiedIdentifier pointing to the virtual column's function. This would allow addRels to add the type of embedding (virtual vs fk-based) to the ReadRequest in:
https://github.com/PostgREST/postgrest/blob/8934190d4359bdf3597339fd4da4b4b250c8dbb9/src/PostgREST/Request/DbRequestBuilder.hs#L133-L137
And based on this, we can then make sure that the actual SELECT from the embedded table is changed to call the function instead:
https://github.com/PostgREST/postgrest/blob/8934190d4359bdf3597339fd4da4b4b250c8dbb9/src/PostgREST/Query/QueryBuilder.hs#L36-L42
This currently creates a subquery like the following:
SELECT "test"."clients".* FROM "test"."clients" WHERE "test"."projects"."client_id" = "test"."clients"."id"
and we'd need to make this:
SELECT "virtual_column".* FROM "test"."virtual_column"("projects")
The result would be embedding based on virtual columns, with the full support of features we currently have: Nesting, selecting subfields, inlining for performance - there is basically no difference, afaict. It should even be possible to use hints, if we fill the Relationship fields in a smart way.
Consequences for Embedding in general
This will solve #591, #1179 and #1230 immediately and would provide a workaround solution for #1643, #1733, #1907 and #1984 - basically any issue with disambiguation, because you can always roll your own in a more explicit way.
Furthermore, we could remove some of the auto-detected fk-based relationships easily, that have recently posed problems with disambiguation in general. We could restrict M2M relationships to simple junction tables with only FK columns, we might decide to remove self-references completely, because those create a lot of challenges on their own. Those can all be replaced with explicit virtual column embedding.
Taking this one step further, we could add a config option db-fk-embedding, which defaults to true. Setting this to false would disable fk-based embedding entirely. This would remove our dependency on constraint names, at least partially, getting us one step closer to true schema isolation. We'd still expose those in error messages for failed constraints, I think, but at least our query API wouldn't depend on it. This would also allow users to be very explicit in which embeddings they want to allow - and would also guarantee, that further improvements in our view-parsing will not break existing queries when updating PostgREST due to new embedding possibilities resulting in a "can't disambiguate" response. This would solve #2070.
Further generalization for function calls
At this stage, ?select=*,func(*) is essentially a function call, which takes the current row as the only argument and allows selecting partial output.
Scalars
The next step would be to allow ?select=*,func() to be used to call functions which return a scalar instead of a composite type. This would solve the problem mentioned above, regarding real and virtual columns with the same name on a given table.
Aggregates
Given that we have a lot of similarities with those virtual columns and the field notation call syntax, we can think of both of the following to be the same (pseudo-code):
GET /projects?select=*,clients(*)
GET /projects?select=*,projects.clients
We don't need the table prefix, because we're calling the function in a a better way, but we can take a similar looking approach to pass columns instead of rows as the only argument to a function:
GET /employees?select=salary.sum()
This would be the same as SELECT sum(employees.salary) FROM employees, essentially.
This is very similar to this suggestion on how to call aggregate functions:
GET /employees?select=total_salaray&total_salaray=sum.salary
However, the new proposal has less boilerplate, less indirection and is more consistent with how we treat functions as virtual columns. It also avoids potential name conflicts with column filters on the operator etc.
It would still be possible to implement the FILTER or HAVING clauses:
GET /users?select=males:count(),non_betas:count()&males?gender=eq.male&non_betas?beta=is.true&non_betas?active=is.false
Even more
It should be possible to extend these concepts to:
- window functions
- chaining multiple functions:
?select=*,col.func1().func2(), which looks like a syntax that could be used for #1233, too.
First-class computed/virtual columns are a sure win!
Taking this one step further, we could add a config option db-fk-embedding, which defaults to true. Setting this to false would disable fk-based embedding entirely
Likely better to discuss this on a different issue, but would thatdb-fk-embedding=false disable automatic resource embedding entirely? Would all embeds require a function?
Likely better to discuss this on a different issue, but would that
db-fk-embedding=falsedisable automatic resource embedding entirely? Would all embeds require a function?
Yes, yes and yes.
Another thing that's good about this one is that it could be used to override our detected relationships as well, so it provides a way forward from breaking changes in embedding. For example, if we start detecting one-to-one relationships users could go back to redefining as a one-to-many.
It's much better than doing some postgREST specific stuff(like on this proposal) to override relationships because it reduces user cognitive load.
Those relationships would have a cardinality of M2O or O2M depending whether the function returns multiple rows or not
So a function that RETURNS SETOF tbl vs just RETURNS tbl. Cool, I'll give this a try.
It should even be possible to use hints, if we fill the Relationship fields in a smart way.
IIUC, this would not need hints because the virtual column would be unambiguous as it's explicitly defining the join with the table and columns.
We would then extend the Cardinality type to hold either the FKConstraint or a QualifiedIdentifier pointing to the virtual column's function.
data Cardinality
= O2M {relConnection :: Either FKConstraint ComputedColumn}
-- ^ one-to-many
| M2O {relConnection :: Either FKConstraint ComputedColumn}
-- ^ many-to-one
| M2M Junction
-- ^ many-to-many
deriving (Eq, Ord, Generic, JSON.ToJSON)
type ComputedColumn = QualifiedIdentifier
newtype FKConstraint = FKConstraint Text [(FieldName, FieldName)]
-- | Relationship between two tables.
data Relationship = Relationship
{ relTable :: QualifiedIdentifier
, relForeignTable :: QualifiedIdentifier
, relIsSelf :: Bool -- ^ Whether is a self relationship
, relCardinality :: Cardinality
}
deriving (Eq, Ord, Generic, JSON.ToJSON)
The target name, the relTable and relForeginTable can be inferred from the function as:
create function target_name(relTable) returns relForeignTable
Kinda makes me think that targetName should be an attribute of Relationship instead of computing it dinamically in findRel
Came up with this query:
-- Adding this on the fixtures:
-- CREATE FUNCTION manual_client_rel(test.projects) RETURNS SETOF test.clients AS $$
-- SELECT * FROM test.clients WHERE id = $1.client_id;
--$$ LANGUAGE sql STABLE ROWS 1;
with
all_relations as (
select reltype
from pg_class
where relkind in ('v','r','m','f','p')
)
select
p.pronamespace::regnamespace as schema,
p.proname as name,
p.proargtypes[0]::regtype as rel_table,
p.prorettype::regtype as rel_ftable,
p.prorows = 1 as single_row
FROM pg_proc p
WHERE
p.pronargs = 1
and p.proargtypes[0] in (select reltype from all_relations)
and p.prorettype in (select reltype from all_relations);
-[ RECORD 1 ]-+------------------
schema | test
name | manual_client_rel
rel_table | test.projects
rel_ftable | test.clients
single_row | t
The result would be embedding based on virtual columns, with the full support of features we currently have: Nesting, selecting subfields, inlining for performance
Perf is great as mentioned above, I get a cheaper plan when calling the computed relationship:
-- our generated query
explain analyze WITH pgrst_source AS (
SELECT "test"."projects".*, row_to_json("projects_clients".*) AS "clients"FROM "test"."projects"
LEFT JOIN LATERAL (
SELECT "test"."clients".*
FROM "test"."clients" WHERE "test"."projects"."client_id" = "test"."clients"."id" ) AS "projects_clients" ON TRUE)
SELECT
pg_catalog.count(_postgrest_t) AS page_total,
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM ( SELECT * FROM pgrst_source ) _postgrest_t;
Aggregate (cost=75.74..75.76 rows=1 width=112) (actual time=0.114..0.117 rows=1 loops=1)
-> Hash Left Join (cost=38.58..63.74 rows=1200 width=72) (actual time=0.045..0.054 rows=5 loops=1)
Hash Cond: (projects.client_id = clients.id)
-> Seq Scan on projects (cost=0.00..22.00 rows=1200 width=40) (actual time=0.013..0.015 rows=5 loops=1)
-> Hash (cost=22.70..22.70 rows=1270 width=36) (actual time=0.019..0.020 rows=2 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 17kB
-> Seq Scan on clients (cost=0.00..22.70 rows=1270 width=36) (actual time=0.010..0.013 rows=2 loops=1)
Planning Time: 0.416 ms
Execution Time: 0.198 ms
-- using a computed relationship
CREATE FUNCTION manual_client_rel(test.projects) RETURNS test.clients AS $$
SELECT * FROM test.clients WHERE id = $1.client_id;
$$ LANGUAGE sql STABLE;
explain analyze WITH pgrst_source AS (
SELECT x.*, row_to_json("projects_clients".*) AS "clients"FROM "test"."projects" x
LEFT JOIN LATERAL (
SELECT "manual_client_rel".*
FROM "manual_client_rel"(x)) AS "projects_clients" ON TRUE)
SELECT
pg_catalog.count(_postgrest_t) AS page_total,
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM ( SELECT * FROM pgrst_source ) _postgrest_t;
Aggregate (cost=58.25..58.28 rows=1 width=112) (actual time=0.359..0.361 rows=1 loops=1)
-> Nested Loop Left Join (cost=0.25..46.25 rows=1200 width=72) (actual time=0.209..0.290 rows=5 loops=1)
-> Seq Scan on projects x (cost=0.00..22.00 rows=1200 width=104) (actual time=0.021..0.029 rows=5 loops=1)
-> Function Scan on manual_client_rel (cost=0.25..0.26 rows=1 width=32) (actual time=0.050..0.050 rows=1 loops=5)
Planning Time: 0.426 ms
Execution Time: 0.437 ms
Of course the row samples are too small(and somehow the execution time is always higher for the computed relationships) but still shows the direction is good.
Perf is great as mentioned above, I get a cheaper plan when calling the computed relationship:
Wait a moment, you're mixing a few things up. What ultimately matters for performance is execution time. And that is clearly worse with the computed relationship. Costs are always only an estimation. Performance is worse, because the query does not inline the function body at the moment. You can tell this from the explain for the second query - it has a "function scan on manual_client_rel" in it. As long as you can see the function name in the explain output, it's not inlined, afaik. So it's not surprising that the cost is estimated lower here, because PG does not look inside the function at all and just assumes the default cost for a function.
According to the inlining requirements, the reason why it's currently not doing that is:
A table function call will be inlined if all of the following conditions are met (Source Code): ...
- the function is declared RETURNS SETOF or RETURNS TABLE
Currently your helper function only RETURNS test.clients. But it needs to RETURNS SETOF test.clients.
How do the query plans look like with that change?
the function is declared RETURNS SETOF or RETURNS TABLE
Ah, didn't know that was an inlining requirement. Hm, so that means we cannot use SETOF to denote a many-to-one relationship. And RETURNS TABLE arleady implies a SETOF.
Those relationships would have a cardinality of M2O or O2M depending whether the function returns multiple rows or not
Did you thought of using SETOF for multiple rows? Or is there another way?
How do the query plans look like with that change?
explain analyze WITH pgrst_source AS (
SELECT x.*, row_to_json("projects_clients".*) AS "clients"FROM "test"."projects" x
LEFT JOIN LATERAL (
SELECT "manual_client_rel".*
FROM "manual_client_rel"(x)) AS "projects_clients" ON TRUE)
SELECT
pg_catalog.count(_postgrest_t) AS page_total,
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM ( SELECT * FROM pgrst_source ) _postgrest_t;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=75.74..75.75 rows=1 width=40) (actual time=0.105..0.108 rows=1 loops=1)
-> Hash Left Join (cost=38.58..63.74 rows=1200 width=72) (actual time=0.043..0.052 rows=5 loops=1)
Hash Cond: (x.client_id = clients.id)
-> Seq Scan on projects x (cost=0.00..22.00 rows=1200 width=40) (actual time=0.012..0.014 rows=5 loops=1)
-> Hash (cost=22.70..22.70 rows=1270 width=36) (actual time=0.018..0.019 rows=2 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 17kB
-> Seq Scan on clients (cost=0.00..22.70 rows=1270 width=36) (actual time=0.010..0.013 rows=2 loops=1)
Planning Time: 0.480 ms
Execution Time: 0.182 ms
Execution time now looks about the same as the query without the computed relationship. "Function scan" is also gone.
Did you thought of using SETOF for multiple rows? Or is there another way?
There's the ROWS parameter. Maybe we can use a ROWS 1 for this case.
ROWS result_rows A positive number giving the estimated number of rows that the planner should expect the function to return. This is only allowed when the function is declared to return a set. The default assumption is 1000 rows.
Edit: That's it. It works fine with inlining too. I should correct the pg_proc query above to use prorows instead of proretset.
So far this has been working great. I mean it even works for rpc:
CREATE FUNCTION test.clients(test.projects) RETURNS SETOF test.clients AS $$
SELECT * FROM test.clients WHERE id = $1.client_id;
$$ LANGUAGE sql STABLE ROWS 1;
curl 'localhost:3000/rpc/getallprojects?select=*,clients(id,name)'
[{"id":1,"name":"Windows 7","client_id":1,"clients":{"id":1,"name":"Microsoft"}},
{"id":2,"name":"Windows 10","client_id":1,"clients":{"id":1,"name":"Microsoft"}},
{"id":3,"name":"IOS","client_id":2,"clients":{"id":2,"name":"Apple"}},
{"id":4,"name":"OSX","client_id":2,"clients":{"id":2,"name":"Apple"}},
generated query
WITH pgrst_source AS (
SELECT "getallprojects".*
FROM "test"."getallprojects"())
SELECT
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM (
SELECT
"projects".*,
row_to_json("projects_clients".*) AS "clients"
FROM "pgrst_source" AS "projects"
LEFT JOIN LATERAL (
SELECT "clients"."id", "clients"."name" FROM "test"."clients"("projects")
) AS "projects_clients" ON TRUE
) _postgrest_t;
Other thing I noticed is that thanks to overloaded functions, we can use the same function name to define relationships for many tables:
CREATE FUNCTION test.clients(test.projects) RETURNS SETOF test.clients
-- curl 'localhost:3000/projects?select=*,clients(id,name)'
CREATE FUNCTION test.clients(test.companies) RETURNS SETOF test.clients
-- curl 'localhost:3000/companies?select=*,clients(id,name)'
CREATE FUNCTION test.clients(test.others) RETURNS SETOF test.clients
-- curl 'localhost:3000/others?select=*,clients(id,name)'
There is one problem though. It's not working for mutations.
curl 'localhost:3000/projects?select=id,name,clients(id,name)' -H "Prefer: return=representation" -H "Content-Type: application/json" -d @- <<JSON
{"id":6,"name":"New Project","client_id":2}
JSON
{"code":"42846","details":"Input has too few columns.","hint":null,"message":"cannot cast type record to projects"}
WITH pgrst_source AS (
WITH
pgrst_payload AS (SELECT '{"id":6,"name":"New Project","client_id":2}'::json AS json_data),
pgrst_body AS ( SELECT CASE WHEN json_typeof(json_data) = 'array' THEN json_data ELSE json_build_array(json_data) END AS val FROM pgrst_payload)
INSERT INTO "test"."projects"("client_id", "id", "name")
SELECT "client_id", "id", "name" FROM json_populate_recordset (null::"test"."projects", (SELECT val FROM pgrst_body)) _
RETURNING "test"."projects"."id", "test"."projects"."name")
SELECT
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM (
SELECT
"projects"."id",
"projects"."name",
row_to_json("projects_clients".*) AS "clients"
FROM "pgrst_source" AS "projects"
LEFT JOIN LATERAL (
SELECT "clients"."id", "clients"."name"
FROM "test"."clients"("projects")
) AS "projects_clients" ON TRUE
) _postgrest_t;
ERROR: cannot cast type record to test.projects
LINE 18: FROM "test"."clients"("projects")
^
DETAIL: Input has too few columns.
At first I thought it was because of the CTE but the RPC query above proves it's not.
It's actually because the RETURNING "test"."projects"."id", "test"."projects"."name" is making the query lose its resulting type. Now if that's changed to RETURNING * it works. I'd need to check if it's possible to modify our queries to always use RETURNING * and select the columns on the next SELECT query.
Found this trick that allows to cast like FROM "test"."clients"("projects"::text::test.projects) but that doesn't help because of the row_to_json, it needs the exact columns.
I'd need to check if it's possible to modify our queries to always use RETURNING * and select the columns on the next SELECT query.
You already have that in the outer SELECT "projects"."id", "projects"."name", right?
So it's just a matter of making it RETURNING * in that case?
I suggest to only make it that way if any computed relationship is used... otherwise, we'd lose the ability to do mutations with select privileges on only a subset of columns, right?
I suggest to only make it that way if any computed relationship is used... otherwise, we'd lose the ability to do mutations with select privileges on only a subset of columns, right?
Yeah, though we'd lose the same ability for computed relationships. Say if you want to insert a row of "projects" but don't have privilege on a new column "cost" and you'd do POST /projects?select=id,name,clients(id,name), since RETURNING * is done then you should get an error for "cost".
I found another way to fix the type, but it would require the schema cache; basically we need to fill the missing columns in the RETURNING with null:<type>.
WITH pgrst_source AS (
WITH
pgrst_payload AS (SELECT '{"id":6,"name":"New Project","client_id":2}'::json AS json_data),
pgrst_body AS ( SELECT CASE WHEN json_typeof(json_data) = 'array' THEN json_data ELSE json_build_array(json_data) END AS val FROM pgrst_payload)
INSERT INTO "test"."projects"("client_id", "id", "name")
SELECT "client_id", "id", "name" FROM json_populate_recordset (null::"test"."projects", (SELECT val FROM pgrst_body)) _
RETURNING "test"."projects"."id", "test"."projects"."name", null::int)
SELECT
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM (
SELECT
"projects"."id",
"projects"."name",
row_to_json("projects_clients".*) AS "clients"
FROM "pgrst_source" AS "projects"
LEFT JOIN LATERAL (
SELECT "clients"."id", "clients"."name"
FROM "test"."clients"("projects")
) AS "projects_clients" ON TRUE
) _postgrest_t;
An alternative that doesn't require the schema cache could also be:
WITH pgrst_source AS (
WITH
pgrst_payload AS (SELECT '{"id":6,"name":"New Project","client_id":2}'::json AS json_data),
pgrst_body AS ( SELECT CASE WHEN json_typeof(json_data) = 'array' THEN json_data ELSE json_build_array(json_data) END AS val FROM pgrst_payload)
INSERT INTO "test"."projects"("client_id", "id", "name")
SELECT "client_id", "id", "name" FROM json_populate_recordset (null::"test"."projects", (SELECT val FROM pgrst_body)) _
RETURNING json_build_object('id', "test"."projects"."id", 'name', "test"."projects"."name"))
SELECT
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM (
SELECT
"projects"."id",
"projects"."name",
row_to_json("projects_clients".*) AS "clients"
FROM json_populate_record(null::test.projects, (SELECT * FROM "pgrst_source")) AS "projects"
LEFT JOIN LATERAL (
SELECT "clients"."id", "clients"."name"
FROM "test"."clients"("projects")
) AS "projects_clients" ON TRUE
) _postgrest_t;
But it doesn't feel right to use it for all returnings, the json casting could mess up some types. Using json functions on the body(defined by content-type) and the response(defined by accept) is fine but not for the generic returning.
Besides the type casting, there's another problem with mutations. For our detected relationships, we enhance the RETURNING with the foreign keys so an embedding can succeed:
https://github.com/PostgREST/postgrest/blob/3950b3621f24870594bcd8c9ea3822133a9f50d8/src/PostgREST/Request/DbRequestBuilder.hs#L350-L377
However with computed relationships the join conditions are basically a private detail, we cannot know for sure which conditions(FKs included) we need for the embedding to succeed.
That leaves the RETURNING * as the only way forward.
Should be noted that computed relationships do work on mutations when specifying * on select:
curl 'localhost:3000/projects?select=*,clients(id,name)' -H "Prefer: return=representation" -H "Content-Type: application/json" -d @- <<JSON
{"id":6,"name":"New Project","client_id":2}
JSON
[{"id":6,"name":"nom","client_id":2,"clients":{"id":2,"name":"Apple"}}]
Since STABLE/IMMUTABLE is a condition for inlining, I think we should enforce that computed relationships are marked as STABLE or otherwise they'll not be detected. (IMMUTABLE doesn't make sense here since that's for pure functions)
However with computed relationships the join conditions are basically a private detail, we cannot know for sure which conditions(FKs included) we need for the embedding to succeed.
That leaves the
RETURNING *as the only way forward.
Yes, came to the same conclusion.
Since STABLE/IMMUTABLE is a condition for inlining, I think we should enforce that computed relationships are marked as STABLE or otherwise they'll not be detected. (IMMUTABLE doesn't make sense here since that's for pure functions)
Please don't. I can easily see computed relationships that could be marked VOLATILE or IMMUTABLE on purpose. For example an "embed a random row" could be VOLATILE. Not being able to inline might not be a show-stopper in this case.
Note: computed relationships will not work for "relational inserts" later, but we have another way on https://github.com/PostgREST/postgrest/issues/818#issuecomment-1212432197.