joist-orm icon indicating copy to clipboard operation
joist-orm copied to clipboard

Queries with multiple o2m joins are expensive

Open stephenh opened this issue 1 year ago • 9 comments

Joist's em.find API makes it easy to "filter by children", i.e. something like:

em.find(Author, { books: { title: "b1" } });

Will find Author have that "at least one" book with a title of b1.

Currently, Joist implements this by LEFT JOIN-ing into books and then DISTINCT-ing on the a.id.

This means we only return 1 author row over the wire (good), but only after asking Postgres to dedupe the author+books rows.

Usually this is fine, until we filter on two o2m collections:

em.find(Author, { books: { title: "b1" }, comments: { text: "c1" } });

This is saying find Authors that have 1 book with "title=b1" and 1 comment with "text=c1".

However, when we LEFT JOIN books and LEFT JOIN comments, postgres creates 1 tuple per author x book x comment combination.

When there many children books/comments, this can create a very large result set that postgres has to create in-memory -- only to DISTINCT it away to return "just the author".

Ideally we'd like evaluate this query by still joining into books and comments, but having those joins produce only a single row, effectively a single roll up/aggregate row that presents "all of this author's books" and "all of this author's comments", which will prevent the combinatorial explosion and mean we don't have to DISTINCT anymore.

stephenh avatar Jan 08 '25 15:01 stephenh

An initial fix of this issue tried to use LATERAL JOINs that would:

  • CROSS JOIN LATERAL into child1
  • Use an aggregate query like count(*) as _ to ensure 1-and-only-1 row was returned
  • For any complex conditions, like "at least one child title is b1", use BOOL_OR(...condition...), along side the count

The idea was that BOOL_OR would both "eval this condition against children" but also "only return the aggregate yes/no matched" instead of the N rows that cause our join explosion.

However, this approach had issues with complex conditions (i.e. not inline conditions) that wanted to do ANDs, i.e. this em.find:

    const b = alias(Book);
    const where = { books: b } satisfies AuthorFilter;
    const authors = await em.find(Author, where, {
      conditions: { and: [b.title.like("fo%"), b.title.like("%ar")] },
    });

Would generate the SQL:

SELECT a.* FROM authors AS a
CROSS JOIN LATERAL (
  SELECT count(*) as _,
  BOOL_OR(b.title LIKE $1) as _b_title_0,
  BOOL_OR(b.title LIKE $2) as _b_title_1
  FROM books AS b
  WHERE a.id = b.author_id AND b.deleted_at IS NULL
) AS b
WHERE a.deleted_at IS NULL AND b._b_title_0 AND b._b_title_1
ORDER BY a.id ASC

This "seems fine", but the semantics of this BOOL_OR query are:

  • Did "any child" have a title like $1,
  • Did "any child" have a title like $2,

Vs. the old (correct) semantics:

  • Did "a specific child have both title like $1 and title like $2"

To fix this, we need these two conditions to be pushed down together, i.e. something like:

CROSS JOIN LATERAL (
  SELECT count(*) as _,
  -- we've got both `AND`s being evaled against "the same book" 
  BOOL_OR(b.title LIKE $1 AND b.title LIKE $2) as _combined
  FROM books ...
) AS b
WHERE b._combined

However, it's not clear how this would work in more complex conditions like:

em.find(Author, { as: a, books: b }, conditions: {
  and: [
    b.title.like("fo%"),
    -- now the "two title checks" are in different clauses
    { or: [b.title.like("%ar"), a.firstName.eq("a1") ] },
  ]
});

We also need to handle queries across levels, i.e.:

em.find(Author, { books: { as: b, reviews: r } }, conditions: {
  or: [
    -- does a specific book have title t1 + any of t1's children have a rating=3 
    { and: [b.title.eq("t1"), br.rating.eq(3)] },
    -- does a specific book have title t2 + any of t2's children have a rating=4
    { and: [b.title.eq("t2"), br.rating.eq(4) ] },
  ]
});

Where the tuples of:

  • [book1, [book1's ratings]
  • [book2, [book2's ratings]
  • [book3, [book3's ratings]

Need to be independently evaluated as a whole.

Ironically/unfortunately this materialization is exactly what the OUTER JOINs were doing.


Maybe the worst query (and one I think is on the table for "we won't support" :thinking: ) is an OR condition that looks at separate children collections, i.e.:

em.find(Author, { books: b, comments: c } }, conditions: {
  or: [
    { and: [b.title.eq("t1"), c.text.eq("good")] },
    { and: [b.title.eq("t2"), c.text.eq("bad") ] },
  ]
});

Where the query really is asking for "every book x every comment" combination to generated and eval'd through the complex condition (...or is it, maybe it's validly asking "do you have any child book with title t1 & any child comment with text "good"--it seems like that would be possible to eval w/o materializing the book x comment combinations).

I think we either need to decide whether:

  1. The prior OUTER JOINs + DISTINCT approach was "fine" in terms of easily/naturally supporting any complex condition the user happened to write (even this last category of "cross child relations") and we should keep using it, with the know issue of "yes, you'll eventually have to write hand-written SQL for queries that get slow"

  2. We can introduce some restrictions (i.e. no "cross child complex conditions") to complex conditions + find a way to rewrite "allowed within child/nested children conditions" to a query structure that preserves the previous semantics

  3. Some of our queries are just fundamentally past the point of "can be expressed in an ORM DSL and do the right / most performant thing".

stephenh avatar Jan 08 '25 16:01 stephenh

What we want to ensure is that, within the query's complex condition, any child alias is "consistently evaluated", i.e. given rows like:

  • books id=b:1 title=b1 notes=b1
  • books id=b:2 title=b2 notes=n2
  • comments id=c:1 text=c1
  • `comments id=c:2 text=c2

That an em.find(Author, {}, conditions: ...) would not "mistakenly pass" by having some of it's condition evaluated against book 1 / title=b1 but another part evaluated against book 2 / notes=n2, when fundamentally there isn't "a singular book" that has both of those conditions true.

Note that within a specific condition, like and: [b.title.eq(b1), b.notes.eq(n2), we can push both of those conditions down together.

The wrinkle would be more esoteric conditions, that involved nested AND/ORs, albeit if they involve "ORs", then it might be fine to use "the other book"? :thinking:

Something like:

em.find(
  Author,
  { books: b, comments: c },
  conditions: {
    or: [
      { and: [b.title.eq("b1"), { or: [b.notes.eq("n2"), c.text.eq("c3")] } },
      { and: [b.title.eq("b2"), { or: [b.notes.eq("n1"), c.text.eq("c3")] } },
    ]
  },
);

Would, given the above rows, not pass because there isn't a singular b that has both title=b1 and notes=n2 or title=b2 and notes=n1, but because the top-level query would look like:

select * from authors
  lateral join books b
  lateral join comments c 
where
  (b.condition_1 AND (b.condition_2 OR c.condition_3))
  OR
  (b.condition_4 AND (b.condition_5 OR c.condition_6))

Because we can't "keep together" both b.condition_1 and b.condition_2 (because we have to stay at the top-level, to keep access to c.condition_3), then two conditions could drift and eval against different children.

stephenh avatar Jan 10 '25 17:01 stephenh

Also an example of why CTEs might perform better than CROSS JOIN LATERALs:

-- 360ms (execution 84 ms, fetching 273 ms)
select a. *
from approvals as a
cross join lateral (
  select count ( * ) as _
  from approvers as a1
  where a.id = a1.approval_id AND ( a1.user_internal_user_id =  12 )
) as a1
where a1._ > 0
order by a.due_on ASC, a.id ASC

-- 155 ms (execution 5 ms, fetching 150 ms)
WITH approval_counts AS (
  SELECT a.id, COUNT(*) as approver_count
  FROM approvals a JOIN approvers a1 ON a.id = a1.approval_id
  WHERE a1.user_internal_user_id = 12
  GROUP BY a.id
)
SELECT a.*
FROM approvals a JOIN approval_counts ac ON a.id = ac.id
WHERE ac.approver_count > 0
ORDER BY a.due_on ASC, a.id ASC

stephenh avatar Jan 10 '25 19:01 stephenh

This is problematic query from our production app:

select
  b.*
from bills as b
inner join project_stages as ps on b.project_stage_id = ps.id
inner join projects as p on ps.project_id = p.id
cross join lateral (
  select
    count(*) as _,
    BOOL_OR(ptm.internal_user_id = ANY($1)) as _ptm_internal_user_id_0,
    BOOL_OR(ptm.project_role_id = $2) as _ptm_project_role_id_1,
    BOOL_OR(ptm.project_role_id = ANY($3)) as _ptm_project_role_id_2
  from project_team_members as ptm
  where p.id = ptm.project_id
) as ptm
where (
  ptm._ptm_internal_user_id_0
  AND b.pending_review = $4 -- true
  AND (
    -- P1 and P2 should see bills w/reasons that are not AutoParsed
    -- ptm.projectRole.in([ProjectRole.PurchasingOne, ProjectRole.PurchasingTwo])
    (ptm._ptm_project_role_id_2 AND b.review_reason != $5 AND b.review_reason IS NOT NULL)
    -- Accounts see AutoParsed
    -- ptm.projectRole.eq(ProjectRole.ApAccountant), b.reviewReason.eq(BillReviewReason.AutoParsed)
    OR (ptm._ptm_project_role_id_1 AND b.review_reason = $6)
  )
) order by b.id ASC
-- What was happening
-- * Show me bills where I have any PTM
-- * There is some *other* PTM that is P1/P2 and the bill RR is != AutoParsed
-- * There is some *other* PTM that is AP and the bill RR is AutoParsed

At a minimum we want:

select
  b.*
from bills as b
inner join project_stages as ps on b.project_stage_id = ps.id
inner join projects as p on ps.project_id = p.id
cross join lateral (
  select
    count(*) as _,
    BOOL_OR(ptm.project_role_id = $2 AND b.review_reason = $6) as _ptm_condition_1,
    BOOL_OR(ptm.project_role_id = ANY($3) AND b.review_reason != $5 AND b.review_reason IS NOT NULL) as as_ptm_condition_2
  from project_team_members as ptm
  where p.id = ptm.project_id
    AND ptm.internal_user_id = me -- <== biggest win/fix
) as ptm
where (
  b.pending_review = $4 -- true
  AND (ptm._ptm_condition_2 OR ptm._ptm_condition_1)
) order by b.id ASC

Or at most:

with _ptms as (
  select
    ptm.project_id,
    count(*) as _,
    BOOL_OR(ptm.project_role_id = $2) as _ptm_role_1,
    BOOL_OR(ptm.project_role_id = ANY($3)) as _ptm_role_2
  from project_team_members as ptm
  where 
    ptm.internal_user_id = me -- <== biggest win/fix
  group by ptm.project_id
)
select
  b.*
from bills as b
inner join project_stages as ps on b.project_stage_id = ps.id
inner join projects as p on ps.project_id = p.id
inner join _ptms as ptm on ptm.project_id = p.id
where
  b.pending_review = $4 -- true
  AND (
    (_ptms._ptm_role_1 AND b.review_reason = $6)
    OR (_ptms._ptm_role_2 AND b.review_reason != $5 AND b.review_reason IS NOT NULL)
  )
order by b.id ASC

stephenh avatar Jan 11 '25 16:01 stephenh

Fundamentally, the BOOL_OR approach represents an "any child" / "fuzzy child" match.

And if we end up with WHERE clauses like:

  • (any child) OR (any child) ==> this is fine b/c the clauses are allowed to match on "different children, but
  • (any child) AND (any child) ==> this is not fine, b/c the clauses might have matched on "different children"

...also, I'm realizing that b/c the CTE approach cannot access data from the "parent" (i.e. they are not correlated queries), we're restricted from pushing down as many queries.

I.e. when em.find(Author)-ing:

  • With CROSS JOIN LATERALs, we could push a.firstName=a1 AND b.title=b2 into the books join, but
  • With CTEs, the a.firstName must stay, and only the b.title=b2 can get pushed down.

stephenh avatar Jan 12 '25 19:01 stephenh

Mikro's version of multi-level loading:

    await authorRepository.find(
      {},
      {
        limit: size,
        populate: ["books", "books.reviews", "books.tags"],
        orderBy: { id: "ASC" },
      },
    );
select
  "a0".*,
  "b1"."id" as "b1__id", "b1"."title" as "b1__title", "b1"."author_id" as "b1__author_id", "b1"."published" as "b1__published", "b1"."pages" as "b1__pages", "b1"."created_at" as "b1__created_at", "b1"."updated_at" as "b1__updated_at",
  "r2"."id" as "r2__id", "r2"."book_id" as "r2__book_id", "r2"."rating" as "r2__rating", "r2"."text" as "r2__text", "r2"."created_at" as "r2__created_at", "r2"."updated_at" as "r2__updated_at",
  "t3"."id" as "t3__id", "t3"."name" as "t3__name", "t3"."created_at" as "t3__created_at", "t3"."updated_at" as "t3__updated_at"
from "author" as "a0"
left join "book" as "b1" on "a0"."id" = "b1"."author_id"
left join "book_review" as "r2" on "b1"."id" = "r2"."book_id"
left join "book_tag" as "b4" on "b1"."id" = "b4"."book_id"
left join "tag" as "t3" on "b4"."tag_id" = "t3"."id"
where "a0"."id" in (
  select "a0"."id" from (
    select "a0"."id" from "author" as "a0"
    left join "book" as "b1" on "a0"."id" = "b1"."author_id"
    left join "book_review" as "r2" on "b1"."id" = "r2"."book_id"
    left join "book_tag" as "b4" on "b1"."id" = "b4"."book_id"
    left join "tag" as "t3" on "b4"."tag_id" = "t3"."id"
    group by "a0"."id"
    order by min("a0"."id")
    asc limit 1000
  ) as "a0"
)
order by "a0"."id" asc

stephenh avatar Apr 01 '25 03:04 stephenh

Another example of "multiple left joins" tripping engineers up:

https://gist.github.com/rxliuli/be31cbded41ef7eac6ae0da9070c8ef8#avoiding-multi-table-left-joins

stephenh avatar Apr 08 '25 14:04 stephenh

This will be another tough production test case:

export const costCodes: Pick<QueryResolvers, "costCodes"> = {
  costCodes(root, { filter }, { em }) {
    const { tradePartnerIds, version = [1], ...others } = filter ?? {};
    const [c, cco, bc] = aliases(Commitment, Commitment, BidContract, BidItem);
    return em.findGql(
      CostCode,
      {
        ...others,
        version,
        items: {
          projectItems: {
            commitmentLineItems: {
              commitment: { as: c },
              changeOrder: { commitment: { as: cco } },
            },
          },
          bidItems: {
            bidContractLineItems: {
              revision: {
                bidContract: { as: bc },
              },
            },
          },
        },
      },
      {
        orderBy: { number: "ASC" },
        conditions: {
          // If we have any trade partner ids, we need to filter by them
          or: tradePartnerIds?.nonEmpty
            ? [
                c.tradePartner.in(tradePartnerIds),
                cco.tradePartner.in(tradePartnerIds),
                bc.tradePartner.in(tradePartnerIds),
              ]
            : [],
        },
      },
    );
  },
};

Because that or reaches across two separate o2m paths.

stephenh avatar May 02 '25 16:05 stephenh

Prisma:

    await prisma.author.findMany({
      where: {
        OR: [
          { books: { some: { title: "b2", reviews: { some: { rating: 2 } } } } },
          { books: { some: { title: "b3", reviews: { some: { rating: 3 } } } } },
        ],
      },
    });
SELECT
 "public"."author"."id",
 "public"."author"."first_name",
 "public"."author"."last_name",
 "public"."author"."email",
 "public"."author"."created_at",
 "public"."author"."updated_at"
FROM "public"."author" WHERE (
  EXISTS(
    SELECT "t0"."author_id" FROM "public"."book" AS "t0"
    WHERE ("t0"."title" = $1 AND
      EXISTS(
        SELECT "t1"."book_id" FROM "public"."book_review" AS "t1"
        WHERE ("t1"."rating" = $2 AND ("t0"."id") = ("t1"."book_id") AND "t1"."book_id" IS NOT NULL)
      )
      AND ("public"."author"."id") = ("t0"."author_id")
      AND "t0"."author_id" IS NOT NULL
    )
  )
 OR
   EXISTS(SELECT "t2"."author_id" FROM "public"."book" AS "t2" WHERE ("t2"."title" = $3 AND EXISTS(SELECT "t3"."book_id" FROM "public"."book_review" AS "t3" WHERE ("t3"."rating" = $4 AND ("t2"."id") = ("t3"."book_id") AND "t3"."book_id" IS NOT NULL)) AND ("public"."author"."id") = ("t2"."author_id") AND "t2"."author_id" IS NOT NULL))) OFFSET $5

stephenh avatar May 04 '25 03:05 stephenh