dataloader icon indicating copy to clipboard operation
dataloader copied to clipboard

Allow full ecto functionality

Open jfrolich opened this issue 6 years ago • 19 comments

Now we can just load by id or a named association. It would be nice if we can support the full ecto functionality by having the ability to load any queryable. We cannot do batching on queryables, so that benefit is gone, but we can execute the queries in parallel, reducing waterfalls and apply simple caching. Any thoughts?

jfrolich avatar Mar 19 '18 08:03 jfrolich

The built in loading function does a bit more than I think you realize. Suppose you have a Category schema that has a slug column like "news", "weather", "opinion" that sort of thing. You can look those up via:

|> Dataloader.load(Source, {:one, Category}, slug: "news")

For truly arbitrary stuff though you only need the run_batch option. To pick an example from the tests: You've got users who can create posts, and posts have a title. Suppose I want to lookup all users who have made a post with some set of titles.

source =
  Dataloader.Ecto.new(Repo,
    query: &query/2,
    run_batch: &run_batch/5
  )

def run_batch(User, query, :post_title, titles, repo_opts) do
      query =
      from(
        u in query,
        join: p in assoc(u, :posts),
        where: p.title in ^titles,
        select: {p.title, u}
      )

    results =
      query
      |> Repo.all(repo_opts)
      |> Enum.group_by(fn {title, _} -> title end, fn {_, user} -> user end)

    for title <- titles, do: Map.get(results, title, [])
end

Then you use this via:

|> Dataloader.load(Source, {:one, User}, post_title: "Foo")

The run_batch function lets you do basically anything you want. If all you want is parallel execution then you're probably just better off with Task.async_stream in the ordinary case or async with Absinthe.

Avoiding ecto queries outside the context module is a central design principle of Dataloader, which I can talk more about if you want.

benwilson512 avatar Mar 19 '18 11:03 benwilson512

Avoiding ecto queries outside the context module is a central design principle of Dataloader, which I can talk more about if you want.

Ok, that might be a difference. Ideally you'd like to (lazy) load data within the context. That's not really practical with the current implementation, but my PR is a step into that direction.

Interested to hear your thoughts, but I don't really agree with the premise.

IMO the dataloader should work at a Repo level, and the contexts should leverage dataloader as tool on top of Ecto or another source, to efficiently load data in terms of IO. So batching, caching and paralyzing. The GraphQL API layer should just call the context, and the context is responsible for leveraging dataloader (or not). What you gain is that every part of your app will load data efficiently. Additionally every dataloading that happens in the business logic layer that does not directly map to graphql fields (like authorization) can now also be efficiently loaded (these things can make the current dataloader pretty worthless in the current form).

I agree that queries should not happen outside of contexts.

It would be nice if we have the ability to load any queryable even though they might not batch (it's not only a batching library!). One benefit for instance, is that it potentially would be possible to cache direct id lookups if they already appeared in a list for instance (not yet implemented, but would be quite trivial). If we cannot run the list query (might be a untypical query) through dataloader, we cannot do that.

We can do Task.async_stream indeed, but dataloader will abstract this away. As it knows the queries that are pending. Building another abstraction just like dataloader but for non-batchable queries would just add more complexity, and in my opinion dataloader is a dataloader and not just a tool to batch.

jfrolich avatar Mar 19 '18 15:03 jfrolich

That's an interesting take, and I'll have to think about it more. My concern is that you'd either have an enormous proliferation of API functions from the context so you could grab stuff from GraphQL, or you'd simply be duplicating the dataloader API and then you're just using dataloader from the resolvers anyway.

Setting aside questions of how Dataloader should be used for a moment though, how do you feel that Dataloader itself prevents you from doing what you want? The built in Ecto source takes the stance I've articulated for sure, but you could simply write another source that works differently. The built in Ecto source can't and won't be going undergoing a conceptual rewrite any time soon, at a minimum the book depends on its current functionality.

The point of using a protocol however is to let other folks do what they want, and I think that there are a wide variety of options available.

benwilson512 avatar Mar 19 '18 16:03 benwilson512

So the issue above can be implemented in the Ecto source as is. Why not add more flexibility?

As for having multiple Ecto sources and these being contexts, that's a design decision but it doesn't really matter too much as you can just create a single Ectoloader module. So I have no problem with that.

As for the dataloader in the current form, yes it severely limits what I can do. I'm not married to the approach. Dataloader now is nice if your database maps exactly to your graphql schema and your context does not have logic that needs to reach out to the database like authorization logic.

It's not so much an interesting take to do the dataloading in the business logic (context), but it is the intended purpose of dataloader (the facebook version).

I don't get how this approach will result in a proliferation of API functions. The resolver will just call Posts.get(id, account) for each Post so the API surface area is very small. And the resolver's only responsibility is call the right context functions (no data loading).

If you use the dataloader in the get function of the context, it will be executed efficiently. See my PR how that is possible by using a deferrable so the get function is not executed eagerly.

Anyway to have the ability to use dataloader in contexts, it just needs an abstraction on top of the current form of Dataloader, like the Lazyloader implementation in the PR. So the API of dataloader won't change (it just adds another module). I encourage you to take a look at it.

jfrolich avatar Mar 20 '18 03:03 jfrolich

If you don't think the Lazyloader approach should be part of dataloader, I can publish it as a seperate package. However in my opinion this should be part of dataloader (as it's also part of the other instances of dataloader in other languages).

jfrolich avatar Mar 20 '18 04:03 jfrolich

That's an interesting take, and I'll have to think about it more. My concern is that you'd either have an enormous proliferation of API functions from the context so you could grab stuff from GraphQL, or you'd simply be duplicating the dataloader API and then you're just using dataloader from the resolvers anyway.

I don't understand how the context "grabs stuff" from graphql. I think you are referring to re-using the dataloader state across context functions.

jfrolich avatar Mar 20 '18 04:03 jfrolich

As for the dataloader in the current form, yes it severely limits what I can do.

I was hoping you would offer specifics about what you find lacking in the Dataloader.Source protocol. I know that you find the Dataloader.Ecto implementation lacking but I'm trying to figure out if you feel that the protocol functions found within Dataloader.Source make it impossible for you to achieve what you want with a different source.

it is the intended purpose of dataloader (the facebook version).

Correct me if I'm wrong but the Facebook dataloader exclusively does batched key value lookups, so we're already pretty far afield from its functionality.

Let me take a step back and articulate the problem I was trying to solve with Dataloader. Dataloader exists to solve not just a technical set of problems, but also some conceptual and structural problems that arise in projects.

The Problem

Phoenix Contexts forced Absinthe to re-think data access. The most important characteristic of a context in my view is that it has the responsibility of enforcing your business rules with respect to the data and functionality in its charge. This is I think most important for mutations to that data, but it applies to queries as well.

However, this poses a problem for both Absinthe and even regular controllers. For Absinthe, the assoc/2 helpers from absinthe_ecto aren't really compatible, because by directly walking all the ecto associations there's no easy to enforce data access rules from within the context (IE, don't return things with deleted: true to pick a simple example). With absinthe_ecto out, Absinthe was back to the N+1 problem. Controllers have all of these same problems because they used Repo.preload all the time which would also skip data access rules, and any cross-context lookups required obnoxious helper functions to avoid N+1.

Dataloader

Central to the Dataloader implementation then was the idea that all of the actual data retrieval and access would happen within the context code. The context would have both full access to all of the data in its domain, as well as the ability to apply restrictions and limit access patterns for any query that came through Dataloader. The exact manner by which this happens of course depends on the Dataloader source, and I very explicitly didn't want Dataloader to be Ecto specific.

If you have context A and context B, and A can call a function in B and pass in an Ecto query that relates to B's schemas, then we're violating that control. The whole point is to prevent arbitrary SQL access at arbitrary points in the application. If A wants to use Dataloader within itself to load things, and you want to do so with Ecto query inputs then that seems fine, but at this point we're just basically re-creating Ecto.Repo with a different API.

benwilson512 avatar Mar 20 '18 12:03 benwilson512

I agree with all of the above. But I don't think you get my point.

The above explains how you design your modules, agree 100%. However if you use a module you shouldn't be exposed to dataloader at all. Using dataloader within a context should just be an implementation detail inside of contexts, as a consumer of a context (for instance for a GraphQL API layer, I shouldn't have to know or care about it.

The point is that in the current implementation if I use the dataloader inside a function in a context, for instance Post.get(id, actor), it doesn't batch if this function is called multiple times (I can only batch within the function).

In the current implementation the dataloader is exposed in the graphql layer (should need to be IMO), the graphql layer should just translate a request for something into the right call to the internal API (context). Much like controllers (you call the context, you aren't exposed to Ecto).

How can we do that in Elixir when we execute eagerly and don't have a concept like ticks (the javascript dataloader implementation uses a hack where it waits 1 tick before executing queries so it can batch queries that happen at the same time)?

Well we need to have lazy evaluation. I did this by implementing a Deferrable protocol. A function can return a Deferrabla then when I want to evaluate this Deferrable I need to do that explicitly. The graphql layer can evaluate these deferables so that all queries are batched/cached etc efficiently.

With this solution the Dataloader just becomes an implementation detail in a context. The only thing we need is a simple middleware that knows how to handle deferred values.

As for the concrete example it is impossible with the current dataloader implementation to do authorization of fields (for each posts in a list of posts) efficiently whereby you need to query associations. Preloads can be a solution, but that only works for simple cases, and is not ideal. I think this is something that is needed in almost any non-trivial app, so not a very contrived example.

jfrolich avatar Mar 21 '18 10:03 jfrolich

I know that you find the Dataloader.Ecto implementation lacking but I'm trying to figure out if you feel that the protocol functions found within Dataloader.Source make it impossible for you to achieve what you want with a different source.

As per above, it's not really about Source. However it would be nice to make Dataloader.Ecto feature complete so we can run any query (small change).

Correct me if I'm wrong but the Facebook dataloader exclusively does batched key value lookups, so we're already pretty far afield from its functionality.

If you use Facebook's javascript dataloader in the context/model whatever the name where the business logic is it will batch repeat calls, like this:

Post.get(1)
Post.get(2)
Post.get(3)
Post.get(1)

That only results in a single query. Now if I use dataloader in my elixir get context function it will not batch at al. It will only batch within get.

So that is the functionality that is lacking now. And in my opinion it's the most important feature.

If the Post.get simply get's a record from the posts table, we can solve that with the elixir Dataloader. However the get probably queries the roles table as well to see if the current user has a role in this post, if it's not a public posts. If that is not the case the post might have a parent post so we'd like to query the parent post to see if the current user has a role in the parent posts (for instance). Another example we'd might want to check if the current user is a friend of one of the authors in the post.

These queries related to authorization cannot be batched currently, but they are the largest source for n+1 issues because it's usually multiple queries per resource.

jfrolich avatar Mar 21 '18 10:03 jfrolich

there's no easy to enforce data access rules from within the context (IE, don't return things with deleted: true to pick a simple example

Not showing deleted posts is really the job of the function that returns a list of posts in the context IMO, like if I run:

Posts.of_user(user_id)

But I might be interested in the deletions in some cases:

Posts.of_user_including_deleted(user_id)

The dataloader should query whatever the query is inside of those functions. If we have more complex rules like a deleted field in associations, IMO it's better to have these in a separate function.

Controllers have all of these same problems because they used Repo.preload all the time which would also skip data access rules, and any cross-context lookups required obnoxious helper functions to avoid N+1.

I don't think you should use Repo in a controller, so I don't know why a controller would use Repo.preload. But alas if you use Repo.preload in business logic, you are able to use a queryable within preload to control which association to preload.

I agree that the problem is not unique to graphql or controllers or however you'd like to talk to the business logic. So that shows that the solution to this should be in the business logic.

jfrolich avatar Mar 21 '18 10:03 jfrolich

These are a lot of great ideas and I think I better understand your position now. In essence, you want Dataloader to be an implementation detail of the context, instead of an explicit part of the API of the context. Once it's an implementation detail of the context then we should feel free to pass in any argument value (like queries) that we want.

I actually feel like we're closer to being on the same page than may be obvious. Dataloader started off as "how can I make an API where the actual data retrieval mechanism is an implementation detail". Deferrables are trying to say "how can I make even using dataloader an implementation detail". I'm not entirely convinced however that deferrables aren't exactly what dataloader is, just with guarantees around the API to try to actually get batching.

Here are my questions:

deferred_post_1 = Blog.get_post(1, actor)
deferred_post_2 = Blog.get_post(2, actor)
  1. what is actor?
  2. What does get_post look like?

To expand on #2, I'm unclear on how batching actually happens. Maybe I need to simply dig into your implementation of the Deferrables protocol.

As for the concrete example it is impossible with the current dataloader implementation to do authorization of fields (for each posts in a list of posts) efficiently whereby you need to query associations.

This seems seems entirely possible to me, so maybe working through this example will help further this discussion. Can you expand on this situation a bit so I can sketch a "concrete" solution?

benwilson512 avatar Mar 21 '18 12:03 benwilson512

Nice. I'll be speaking at a meetup about this. So I can work on explaining this in a bit more detail and work on the implementation.

Let me work out the example in code, so you'll have a better idea why it's not possible with the current dataloader (or there might be a solution for it, I'll doubt it would be a good solution but maybe I'll just don't see it😅).

I will also work out the example using the proposed solution of the PR.

  1. what is actor?

it's the acting user (a name convention we use in our codebase)

  1. What does get_post look like?

I'll write an example in code but basically:

  1. fetching post,
  2. checking if user can open post (with the data fetching necessary to assert that) and
  3. returning {:ok, post} or {:error, msg}

jfrolich avatar Mar 21 '18 13:03 jfrolich

I actually feel like we're closer to being on the same page than may be obvious. Dataloader started off as "how can I make an API where the actual data retrieval mechanism is an implementation detail". Deferrables are trying to say "how can I make even using dataloader an implementation detail". I'm not entirely convinced however that deferrables aren't exactly what dataloader is, just with guarantees around the API to try to actually get batching.

Dataloader is indeed 90% there. Deferrables solve the last 10%. But let me work it out in code so it's clearer to see.

jfrolich avatar Mar 21 '18 13:03 jfrolich

Any progress on this front ? Pretty interested in this topic and curious to see where it leads eventually. @benwilson512 did you have a chance to watch @jfrolich talk ?

tlvenn avatar May 28 '18 03:05 tlvenn

Hey @jfrolich @tlvenn I just wanted to provide an update. The last 6 months have been overwhelmingly busy for me both with Cargosense stuff as well as personally. Fortunately we've made a lot of progress on Absinthe 1.5 and hope to release it in the week after ElixirConf. At that point I look forward to focusing on this question.

FWIW we've run into this too. It is not possible with the current Dataloader functionality to use it within a context function, which makes it very difficult to have context functions that take on complex business logic while still loading efficiently. I'm very interested in your deferred execution approach. I'm also curious about what could be done with a multiprocess approach.

I apologize for letting this languish, but I am hopeful that I will be able to return to it soon.

benwilson512 avatar Sep 03 '18 15:09 benwilson512

@benwilson512: Same with my startup as well, so I also left it in an incomplete state. Looking forward working on a good solution/API for this!

Looking forward to your talk!

jfrolich avatar Sep 04 '18 03:09 jfrolich

@benwilson512 do you have thoughts on these kind of these kinds of loading issues in 2021? Is there a good answer to this with the current iterations of Absinthe / Dataloader?

eprothro avatar Mar 30 '21 21:03 eprothro

Hi @eprothro , this is a pretty old topic, what kind of issues are you trying to solve that the current Dataloader does not currently support ?

tlvenn avatar Mar 31 '21 01:03 tlvenn

Hi @tlvenn I don't have a concrete issue. I'm evaluating implementing dataloader on a project and was curious about usage from context functions and more complex authorization scenarios. Ben and Jaap mentioned those here, so was curious of their take today and what patterns they're using for the concepts raised in this issue.

eprothro avatar Apr 02 '21 16:04 eprothro