Queries with multiple o2m joins are expensive
Joist's em.find API makes it easy to "filter by children", i.e. something like:
em.find(Author, { books: { title: "b1" } });
Will find Author have that "at least one" book with a title of b1.
Currently, Joist implements this by LEFT JOIN-ing into books and then DISTINCT-ing on the a.id.
This means we only return 1 author row over the wire (good), but only after asking Postgres to dedupe the author+books rows.
Usually this is fine, until we filter on two o2m collections:
em.find(Author, { books: { title: "b1" }, comments: { text: "c1" } });
This is saying find Authors that have 1 book with "title=b1" and 1 comment with "text=c1".
However, when we LEFT JOIN books and LEFT JOIN comments, postgres creates 1 tuple per author x book x comment combination.
When there many children books/comments, this can create a very large result set that postgres has to create in-memory -- only to DISTINCT it away to return "just the author".
Ideally we'd like evaluate this query by still joining into books and comments, but having those joins produce only a single row, effectively a single roll up/aggregate row that presents "all of this author's books" and "all of this author's comments", which will prevent the combinatorial explosion and mean we don't have to DISTINCT anymore.
An initial fix of this issue tried to use LATERAL JOINs that would:
CROSS JOIN LATERALinto child1- Use an aggregate query like
count(*) as _to ensure 1-and-only-1 row was returned - For any complex conditions, like "at least one child title is
b1", useBOOL_OR(...condition...), along side the count
The idea was that BOOL_OR would both "eval this condition against children" but also "only return the aggregate yes/no matched" instead of the N rows that cause our join explosion.
However, this approach had issues with complex conditions (i.e. not inline conditions) that wanted to do ANDs, i.e. this em.find:
const b = alias(Book);
const where = { books: b } satisfies AuthorFilter;
const authors = await em.find(Author, where, {
conditions: { and: [b.title.like("fo%"), b.title.like("%ar")] },
});
Would generate the SQL:
SELECT a.* FROM authors AS a
CROSS JOIN LATERAL (
SELECT count(*) as _,
BOOL_OR(b.title LIKE $1) as _b_title_0,
BOOL_OR(b.title LIKE $2) as _b_title_1
FROM books AS b
WHERE a.id = b.author_id AND b.deleted_at IS NULL
) AS b
WHERE a.deleted_at IS NULL AND b._b_title_0 AND b._b_title_1
ORDER BY a.id ASC
This "seems fine", but the semantics of this BOOL_OR query are:
- Did "any child" have a title like $1,
- Did "any child" have a title like $2,
Vs. the old (correct) semantics:
- Did "a specific child have both title like $1 and title like $2"
To fix this, we need these two conditions to be pushed down together, i.e. something like:
CROSS JOIN LATERAL (
SELECT count(*) as _,
-- we've got both `AND`s being evaled against "the same book"
BOOL_OR(b.title LIKE $1 AND b.title LIKE $2) as _combined
FROM books ...
) AS b
WHERE b._combined
However, it's not clear how this would work in more complex conditions like:
em.find(Author, { as: a, books: b }, conditions: {
and: [
b.title.like("fo%"),
-- now the "two title checks" are in different clauses
{ or: [b.title.like("%ar"), a.firstName.eq("a1") ] },
]
});
We also need to handle queries across levels, i.e.:
em.find(Author, { books: { as: b, reviews: r } }, conditions: {
or: [
-- does a specific book have title t1 + any of t1's children have a rating=3
{ and: [b.title.eq("t1"), br.rating.eq(3)] },
-- does a specific book have title t2 + any of t2's children have a rating=4
{ and: [b.title.eq("t2"), br.rating.eq(4) ] },
]
});
Where the tuples of:
[book1, [book1's ratings][book2, [book2's ratings][book3, [book3's ratings]
Need to be independently evaluated as a whole.
Ironically/unfortunately this materialization is exactly what the OUTER JOINs were doing.
Maybe the worst query (and one I think is on the table for "we won't support" :thinking: ) is an OR condition that looks at separate children collections, i.e.:
em.find(Author, { books: b, comments: c } }, conditions: {
or: [
{ and: [b.title.eq("t1"), c.text.eq("good")] },
{ and: [b.title.eq("t2"), c.text.eq("bad") ] },
]
});
Where the query really is asking for "every book x every comment" combination to generated and eval'd through the complex condition (...or is it, maybe it's validly asking "do you have any child book with title t1 & any child comment with text "good"--it seems like that would be possible to eval w/o materializing the book x comment combinations).
I think we either need to decide whether:
-
The prior
OUTER JOINs +DISTINCTapproach was "fine" in terms of easily/naturally supporting any complex condition the user happened to write (even this last category of "cross child relations") and we should keep using it, with the know issue of "yes, you'll eventually have to write hand-written SQL for queries that get slow" -
We can introduce some restrictions (i.e. no "cross child complex conditions") to complex conditions + find a way to rewrite "allowed within child/nested children conditions" to a query structure that preserves the previous semantics
-
Some of our queries are just fundamentally past the point of "can be expressed in an ORM DSL and do the right / most performant thing".
What we want to ensure is that, within the query's complex condition, any child alias is "consistently evaluated", i.e. given rows like:
books id=b:1 title=b1 notes=b1books id=b:2 title=b2 notes=n2comments id=c:1 text=c1- `comments id=c:2 text=c2
That an em.find(Author, {}, conditions: ...) would not "mistakenly pass" by having some of it's condition evaluated against book 1 / title=b1 but another part evaluated against book 2 / notes=n2, when fundamentally there isn't "a singular book" that has both of those conditions true.
Note that within a specific condition, like and: [b.title.eq(b1), b.notes.eq(n2), we can push both of those conditions down together.
The wrinkle would be more esoteric conditions, that involved nested AND/ORs, albeit if they involve "ORs", then it might be fine to use "the other book"? :thinking:
Something like:
em.find(
Author,
{ books: b, comments: c },
conditions: {
or: [
{ and: [b.title.eq("b1"), { or: [b.notes.eq("n2"), c.text.eq("c3")] } },
{ and: [b.title.eq("b2"), { or: [b.notes.eq("n1"), c.text.eq("c3")] } },
]
},
);
Would, given the above rows, not pass because there isn't a singular b that has both title=b1 and notes=n2 or title=b2 and notes=n1, but because the top-level query would look like:
select * from authors
lateral join books b
lateral join comments c
where
(b.condition_1 AND (b.condition_2 OR c.condition_3))
OR
(b.condition_4 AND (b.condition_5 OR c.condition_6))
Because we can't "keep together" both b.condition_1 and b.condition_2 (because we have to stay at the top-level, to keep access to c.condition_3), then two conditions could drift and eval against different children.
Also an example of why CTEs might perform better than CROSS JOIN LATERALs:
-- 360ms (execution 84 ms, fetching 273 ms)
select a. *
from approvals as a
cross join lateral (
select count ( * ) as _
from approvers as a1
where a.id = a1.approval_id AND ( a1.user_internal_user_id = 12 )
) as a1
where a1._ > 0
order by a.due_on ASC, a.id ASC
-- 155 ms (execution 5 ms, fetching 150 ms)
WITH approval_counts AS (
SELECT a.id, COUNT(*) as approver_count
FROM approvals a JOIN approvers a1 ON a.id = a1.approval_id
WHERE a1.user_internal_user_id = 12
GROUP BY a.id
)
SELECT a.*
FROM approvals a JOIN approval_counts ac ON a.id = ac.id
WHERE ac.approver_count > 0
ORDER BY a.due_on ASC, a.id ASC
This is problematic query from our production app:
select
b.*
from bills as b
inner join project_stages as ps on b.project_stage_id = ps.id
inner join projects as p on ps.project_id = p.id
cross join lateral (
select
count(*) as _,
BOOL_OR(ptm.internal_user_id = ANY($1)) as _ptm_internal_user_id_0,
BOOL_OR(ptm.project_role_id = $2) as _ptm_project_role_id_1,
BOOL_OR(ptm.project_role_id = ANY($3)) as _ptm_project_role_id_2
from project_team_members as ptm
where p.id = ptm.project_id
) as ptm
where (
ptm._ptm_internal_user_id_0
AND b.pending_review = $4 -- true
AND (
-- P1 and P2 should see bills w/reasons that are not AutoParsed
-- ptm.projectRole.in([ProjectRole.PurchasingOne, ProjectRole.PurchasingTwo])
(ptm._ptm_project_role_id_2 AND b.review_reason != $5 AND b.review_reason IS NOT NULL)
-- Accounts see AutoParsed
-- ptm.projectRole.eq(ProjectRole.ApAccountant), b.reviewReason.eq(BillReviewReason.AutoParsed)
OR (ptm._ptm_project_role_id_1 AND b.review_reason = $6)
)
) order by b.id ASC
-- What was happening
-- * Show me bills where I have any PTM
-- * There is some *other* PTM that is P1/P2 and the bill RR is != AutoParsed
-- * There is some *other* PTM that is AP and the bill RR is AutoParsed
At a minimum we want:
select
b.*
from bills as b
inner join project_stages as ps on b.project_stage_id = ps.id
inner join projects as p on ps.project_id = p.id
cross join lateral (
select
count(*) as _,
BOOL_OR(ptm.project_role_id = $2 AND b.review_reason = $6) as _ptm_condition_1,
BOOL_OR(ptm.project_role_id = ANY($3) AND b.review_reason != $5 AND b.review_reason IS NOT NULL) as as_ptm_condition_2
from project_team_members as ptm
where p.id = ptm.project_id
AND ptm.internal_user_id = me -- <== biggest win/fix
) as ptm
where (
b.pending_review = $4 -- true
AND (ptm._ptm_condition_2 OR ptm._ptm_condition_1)
) order by b.id ASC
Or at most:
with _ptms as (
select
ptm.project_id,
count(*) as _,
BOOL_OR(ptm.project_role_id = $2) as _ptm_role_1,
BOOL_OR(ptm.project_role_id = ANY($3)) as _ptm_role_2
from project_team_members as ptm
where
ptm.internal_user_id = me -- <== biggest win/fix
group by ptm.project_id
)
select
b.*
from bills as b
inner join project_stages as ps on b.project_stage_id = ps.id
inner join projects as p on ps.project_id = p.id
inner join _ptms as ptm on ptm.project_id = p.id
where
b.pending_review = $4 -- true
AND (
(_ptms._ptm_role_1 AND b.review_reason = $6)
OR (_ptms._ptm_role_2 AND b.review_reason != $5 AND b.review_reason IS NOT NULL)
)
order by b.id ASC
Fundamentally, the BOOL_OR approach represents an "any child" / "fuzzy child" match.
And if we end up with WHERE clauses like:
(any child) OR (any child)==> this is fine b/c the clauses are allowed to match on "different children, but(any child) AND (any child)==> this is not fine, b/c the clauses might have matched on "different children"
...also, I'm realizing that b/c the CTE approach cannot access data from the "parent" (i.e. they are not correlated queries), we're restricted from pushing down as many queries.
I.e. when em.find(Author)-ing:
- With
CROSS JOIN LATERALs, we could pusha.firstName=a1 AND b.title=b2into thebooksjoin, but - With CTEs, the
a.firstNamemust stay, and only theb.title=b2can get pushed down.
Mikro's version of multi-level loading:
await authorRepository.find(
{},
{
limit: size,
populate: ["books", "books.reviews", "books.tags"],
orderBy: { id: "ASC" },
},
);
select
"a0".*,
"b1"."id" as "b1__id", "b1"."title" as "b1__title", "b1"."author_id" as "b1__author_id", "b1"."published" as "b1__published", "b1"."pages" as "b1__pages", "b1"."created_at" as "b1__created_at", "b1"."updated_at" as "b1__updated_at",
"r2"."id" as "r2__id", "r2"."book_id" as "r2__book_id", "r2"."rating" as "r2__rating", "r2"."text" as "r2__text", "r2"."created_at" as "r2__created_at", "r2"."updated_at" as "r2__updated_at",
"t3"."id" as "t3__id", "t3"."name" as "t3__name", "t3"."created_at" as "t3__created_at", "t3"."updated_at" as "t3__updated_at"
from "author" as "a0"
left join "book" as "b1" on "a0"."id" = "b1"."author_id"
left join "book_review" as "r2" on "b1"."id" = "r2"."book_id"
left join "book_tag" as "b4" on "b1"."id" = "b4"."book_id"
left join "tag" as "t3" on "b4"."tag_id" = "t3"."id"
where "a0"."id" in (
select "a0"."id" from (
select "a0"."id" from "author" as "a0"
left join "book" as "b1" on "a0"."id" = "b1"."author_id"
left join "book_review" as "r2" on "b1"."id" = "r2"."book_id"
left join "book_tag" as "b4" on "b1"."id" = "b4"."book_id"
left join "tag" as "t3" on "b4"."tag_id" = "t3"."id"
group by "a0"."id"
order by min("a0"."id")
asc limit 1000
) as "a0"
)
order by "a0"."id" asc
Another example of "multiple left joins" tripping engineers up:
https://gist.github.com/rxliuli/be31cbded41ef7eac6ae0da9070c8ef8#avoiding-multi-table-left-joins
This will be another tough production test case:
export const costCodes: Pick<QueryResolvers, "costCodes"> = {
costCodes(root, { filter }, { em }) {
const { tradePartnerIds, version = [1], ...others } = filter ?? {};
const [c, cco, bc] = aliases(Commitment, Commitment, BidContract, BidItem);
return em.findGql(
CostCode,
{
...others,
version,
items: {
projectItems: {
commitmentLineItems: {
commitment: { as: c },
changeOrder: { commitment: { as: cco } },
},
},
bidItems: {
bidContractLineItems: {
revision: {
bidContract: { as: bc },
},
},
},
},
},
{
orderBy: { number: "ASC" },
conditions: {
// If we have any trade partner ids, we need to filter by them
or: tradePartnerIds?.nonEmpty
? [
c.tradePartner.in(tradePartnerIds),
cco.tradePartner.in(tradePartnerIds),
bc.tradePartner.in(tradePartnerIds),
]
: [],
},
},
);
},
};
Because that or reaches across two separate o2m paths.
Prisma:
await prisma.author.findMany({
where: {
OR: [
{ books: { some: { title: "b2", reviews: { some: { rating: 2 } } } } },
{ books: { some: { title: "b3", reviews: { some: { rating: 3 } } } } },
],
},
});
SELECT
"public"."author"."id",
"public"."author"."first_name",
"public"."author"."last_name",
"public"."author"."email",
"public"."author"."created_at",
"public"."author"."updated_at"
FROM "public"."author" WHERE (
EXISTS(
SELECT "t0"."author_id" FROM "public"."book" AS "t0"
WHERE ("t0"."title" = $1 AND
EXISTS(
SELECT "t1"."book_id" FROM "public"."book_review" AS "t1"
WHERE ("t1"."rating" = $2 AND ("t0"."id") = ("t1"."book_id") AND "t1"."book_id" IS NOT NULL)
)
AND ("public"."author"."id") = ("t0"."author_id")
AND "t0"."author_id" IS NOT NULL
)
)
OR
EXISTS(SELECT "t2"."author_id" FROM "public"."book" AS "t2" WHERE ("t2"."title" = $3 AND EXISTS(SELECT "t3"."book_id" FROM "public"."book_review" AS "t3" WHERE ("t3"."rating" = $4 AND ("t2"."id") = ("t3"."book_id") AND "t3"."book_id" IS NOT NULL)) AND ("public"."author"."id") = ("t2"."author_id") AND "t2"."author_id" IS NOT NULL))) OFFSET $5