rails icon indicating copy to clipboard operation
rails copied to clipboard

Support distributed preloading calls

Open jonathanhefner opened this issue 1 year ago • 2 comments

This PR is split into two commits.

The 1st commit allows records from already-loaded relations to be used when performing post-hoc preloads of downstream associations.

The 2nd commit modifies the Preloader to track preloaded associations in buckets, and, when preloading a downstream association, to preload that association for all owner records from the same bucket.

The goal of this PR is to allow individual includes and preload calls to be placed alongside the code that depends on them. This makes it easier to add a preload when needed or remove a preload when it is no longer needed.

Consider the following code:

def render_posts(posts)
  posts.includes(:comments).each do |post|
    post.comments.each do |comment|
      # ...
    end
  end
end

def render_authors(authors)
  authors.includes(:posts).each do |author|
    render_posts(author.posts)
  end
end

render_authors(Author.all)

Before the 1st commit, with N authors each having at least 1 post, the code would execute 2N + 2 queries. For example, with 5 authors:

SELECT "authors".* FROM "authors"
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 1]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 11]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 2]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 12]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 3]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 13]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 4]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 14]]
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" = ?  [["author_id", 5]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 15]]

After the 1st commit, the code executes N + 2 queries:

SELECT "authors".* FROM "authors"
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 11]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 12]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 13]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 14]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" = ?  [["post_id", 15]]

After the 2nd commit, the code executes 3 queries:

SELECT "authors".* FROM "authors"
SELECT "posts".* FROM "posts" WHERE "posts"."author_id" IN (?, ?, ?, ?, ?)  [["author_id", 1], ["author_id", 2], ["author_id", 3], ["author_id", 4], ["author_id", 5]]
SELECT "comments".* FROM "comments" WHERE "comments"."post_id" IN (?, ?, ?, ?, ?)  [["post_id", 11], ["post_id", 12], ["post_id", 13], ["post_id", 14], ["post_id", 15]]

Related: #45161, #45413, #45231. The primary difference of this PR is that users must still opt in to preloading via includes or preload. The intent is to allow fine-grained control over when preloading happens, but to make cascading automatic.

jonathanhefner avatar Aug 01 '22 18:08 jonathanhefner

Very cool idea. I think the final SQL output is similar to what you'd get if using goldiloader, without any includes needed. I know goldiloader isn't part of Rails (and it's great to see this in Rails!) but it would be good to confirm if these changes are compatible with it.

ghiculescu avatar Aug 01 '22 19:08 ghiculescu

@ghiculescu Thanks! :smiley: If my understanding is correct, I think #45413 is closer to goldiloader.

In this PR, I wanted to take a more conservative approach, to try to avoid situations where deeply nested code must defensively disable the feature for fear of loading too much data. Granted, that can still happen with this PR, but my hope is that restricting the feature to explicit (distributed) chains of includes / preload calls will greatly reduce the risk.

jonathanhefner avatar Aug 01 '22 19:08 jonathanhefner

Yeah, glad to see another approach around this. I took a shot at this style of api before I submitted my proposed approach. The feedback I received I would like to share.

Many people thought it was not a great ergonomic to have to opt in to the include since that is more frequently the preferred mode and they preferred an opt out.

The tricky thing I think is balancing the shortest code to write that minimizes mistakes we care about most. Large data loads could be a problem but which is our api optimized to fix? Should it more eagerly load to prevent the n+1 or should it eagerly optimize for large data flows and make fixing n+1 queries an opt in?

gwincr11 avatar Aug 21 '22 21:08 gwincr11

I do wonder if perhaps a new api is needed that does not exist. The concern we are working with here is fairly well understood when you maybe loading a lot of records directly from the database, we use the batch api for this. What if the associations allowed for a batch like method to be called?

Something like

post.comments_in_batches.each

This would allow the user to limit the number of things brought back from the database in a familiar way and also allow dynamic includes to be the default. It also provides a familiar api for working with large data sets.

gwincr11 avatar Aug 22 '22 13:08 gwincr11