efcore icon indicating copy to clipboard operation
efcore copied to clipboard

Investigate single query related entity loading and orderings

Open roji opened this issue 1 year ago • 28 comments

When loading related entities in single query mode, our query pipeline currently injects orderings which make all related rows be grouped together. Our shaper relies on this ordering for assigning the related dependents to their correct principal. This issue is about investigating removing those orderings, and using client-side dictionary (or identity) lookups to find the principal instead.

  • Orderings generally impact query planning in a significant way, and require the database to do a lot of work. We've received quite a few user reports about these orderings; we still need to investigate this thoroughly, but it makes sense that the orderings would regress perf significantly in various scenarios.
  • In general, we should strive to remove as much load from the database, even at the cost of running slower at the client, since the database tier is far harder to scale than the application tier. The orderings effectively do the opposite, pushing more work down to the database.
  • One argument in favor of ordering is that it allows EF to stream the results, since all rows related to a principal are grouped together. If we remove the orderings, EF can't return a single principal before it consumes all rows, since there may be another dependent row at the end.
  • However, orderings prevent the database from streaming results back; we're basically pushing the buffering back to the server, increasing memory requirements there (as above, we should be doing the opposite and unloading the database).
  • This also negatively affects the latency of results, as the database starts sending rows back later. Removing the orderings would allow the database to return rows earlier, which is important.
  • We've raised the possibility of using identity resolution as a substitute for the orderings (but only if the queries are identified as buffering in some way). Depending on the perf impact, I think we should consider always doing this, either for AsTracking and NoTrackingWithIdentity, or possibly even for AsNoTracking (which makes it pretty much the same as NoTrackingWithIdentity). Note that we need to investigate why NoTrackingWithIdentity isn't efficient at the moment (#28579).
  • We may want to do the same for final GroupBy, which also injects orderings (#19929).

Thanks @NinoFloris for the conversation around this.

Issues on this:

  • #19571: previous issue where removing the orderings was discussed.
  • #20076: discussed distinguishing between buffering and streaming queries, and proposes removing tracking only for queries which are both buffering and tracking (via identity resolution).
  • #19828: issue for removing only the last ordering (done)

roji avatar Sep 20 '22 19:09 roji

We've raised the possibility of using identity resolution as a substitute for the orderings (but only if the queries are identified as buffering in some way). Depending on the perf impact, I think we should consider always doing this, in any case. This would effectively mean that NoTracking queries become NoTrackingWithIdentity.

One of the uses for AsNoTracking is to stream a result set that won't fit in memory. We shouldn't prevent this, even if it becomes opt-in.

ajcvickers avatar Sep 21 '22 09:09 ajcvickers

I'm wondering if it makes sense to allow users to stream huge results in EF, when doing so forces the database to buffer those same results (which is what the orderings do)... We're basically just pushing the buffering (=huge memory requirements) onto the database, no?

It's true that databases may have ways to better deal with this, e.g. use disk space as temporary storage while ordering the results. I certainly wouldn't want this to be anything close to a default behavior though, as perf there probably becomes truly disasterous.

I wonder if for the huge resultset scenario - which hopefully should be rare - it might not be better for users to deal with this by breaking their single LINQ query into two, separately processing principals and dependents. I'd certainly want to at least aggressively guide users towards considering this, rather than just slapping AsNoTracking to make things "just work", at the relatively hidden cost of increasing database memory usage (and CPU).

In any case, we definitely need to carefully consider what we do here, as changes here may be "breaking" (in the sense that client memory requirements may suddenly change in a very significant way).

(added needs-design)

roji avatar Sep 21 '22 09:09 roji

If a no-tracking query against a single table without ordering still streams efficiently, then that might be enough.

ajcvickers avatar Sep 21 '22 09:09 ajcvickers

Yeah, there's indeed no reason to buffer anywhere if only a single table is involved.

roji avatar Sep 21 '22 09:09 roji

One possible pattern to efficiently stream with relationships is the following (or some variation):

var posts = ctx.Posts
    .Select(p => new
    {
        Post = new Post { Id = p.Id, Title = p.Title },
        Blog = new Blog { Id = p.Blog.Id, Name = p.Blog.Name }
    })
    .ToList();

The point is to avoid the fixup; Assuming single query is used, each row contains all the information needed to materialize a result, and so should be able to stream. That could be acceptable as a user workaround for streaming huge result set scenario with relationships.

roji avatar Sep 21 '22 09:09 roji

A couple more notes on this...

First, when you have a single collection include, (Blogs.Include(b => b.Posts), an ORDER BY on the principal ID probably has little overhead, since there's an index (the primary). However, if you add another nested include (Blogs.Include(b => b.Posts).ThenInclude(p => p.Comments)), we have the second ordering on the Post ID. At this point an index cannot be used any more, and so removing the ordering becomes quite important.

Second, we think it's still important to support streaming (as opposed to buffering) for when results are huge. We should keep in mind that it's not great to achieve this at the expense of forcing the database to buffer, by adding orderings. Possibilities here include:

  • Reversing the query, starting from the dependent, and avoiding navigation fixup (see https://github.com/dotnet/efcore/issues/29171#issuecomment-1253463882).
  • Switch tracking queries to the new mode (without orderings), and leave no-tracking as the streaming mode.
    • This requires us to maintain both techniques (more complexity), and makes no-tracking even more problematic from a perf standpoint.
    • Since using the state manager imposes overhead, for cases where tracking isn't needed, AsNoTrackingWithIdentity becomes the new "best perf" option, as least for single query.
    • We'd have to make sure AsNoTrackingWithIdentity it performs well (see #28579), and educate people to use it. Many people just use AsNoTracking "to make stuff faster".

roji avatar Oct 17 '22 14:10 roji

Please consider non-entity queries (i.e. projecting to a DTO) which are inherently non-tracking and where I suspect identity resolution won't work as a method of buffering either. So buffering at some other level may be required to support this. We do a lot of projection with collections and have retries on so are buffering; dropping order by in this scenario would be great!

stevendarby avatar Nov 23 '22 03:11 stevendarby

Related question: included reference navigations get added to the order by - is this even needed? Could removing those be a simple win with few changes to streaming/buffering required? Can raise a separate issue if preferred.

stevendarby avatar Nov 23 '22 03:11 stevendarby

included reference navigations get added to the order by - is this even needed? Could removing those be a simple win with few changes to streaming/buffering required?

Interesting - yes, please open a separate issue; that sounds like a possibly cheap optimization.

roji avatar Nov 23 '22 07:11 roji

Hello @roji,

how is the topic going? Do you have any plans to close it in near future? It's really urgent for us because of the performance-optimizations topic and it's causing the main issue in our application.

For the workaround I created query interceptor which just throws out the last found ORDER BY clause, but I wonder if that's good solution after I read your's sentence:

Our shaper relies on this ordering for assigning the related dependents to their correct principal.

Please help us!

amaleszewski avatar May 15 '23 13:05 amaleszewski

@amaleszewski EF Core already refrains from generating the last ORDER BY clause (#19828, done in 6.0); if you're removing another ORDER BY, you're removing an important ordering that should lead to some bad loading bugs.

Other than that, this issue is in the backlog and it's unlikely we'll be able to do it for 8.0. However, it's on my list with a high priority, and it's likely I'll try to tackle this for 9.0.

roji avatar May 16 '23 07:05 roji

@roji Thanks for fast response! I throw out yesterday the query interceptor after thinking about it and just changed my one query from split-query to single-query. Thanks to that I reduced time of query from 9s to 40ms without removing the ORDER BY.

amaleszewski avatar May 16 '23 07:05 amaleszewski

I've been reading through the various issues on this and it seems like a complicated problem. So I appreciate a proper fix might be a way off.

However, are there any known workarounds at this time? We have a query that has 12 unnecessary order by clauses with no actual data being returned that increase query response from 0s to 47s. This is longer than the 30s timeout so the user gets an error.

I can't wait for this issue to be resolved - and this will presumably be in a future major version of EF anyway given the re-write you're talking about here. So guidance on a workaround would be greatly appreciated (perhaps added to the original post to save everyone scrolling?)

Thanks for all your hard work on this library.

hisuwh avatar Jun 30 '23 09:06 hisuwh

First off, in many cases where users thought that the orderings are the source of a performance problem, it turned out that it was actually using a single query to load many collect includes (either because of cartesian explosion, or because of duplicated data - see these docs). Switching to split query solved the performance problems completely, showing that ordering wasn't at all the source of the problem.

You write above that you have no actual data being returned; if that's the case, then performance is unlikely to be improve by switching to split query - though a) I'd give it a try, and b) it would be interesting (but definitely not impossible) that orderings would significantly affect such a query as well. So I'd advise trying to switch to split query, and then to confirm that the orderings are the culprit by comparing the raw SQL performance with and without the orderings in simple SQL test.

In any case, this is a high priority issue for me, and I intend to take a serious look at this for EF Core 9.

roji avatar Jun 30 '23 13:06 roji

confirm that the orderings are the culprit by comparing the raw SQL performance with and without the orderings in simple SQL test

@roji This is exactly what I did. Took the raw SQL generated by EF and put it into SSMS. Ran the query which took 47s. Removed the order statements at the end which resulted in it returning practically instantly.

hisuwh avatar Jul 07 '23 16:07 hisuwh

@hisuwh ok, thanks for confirming. This is definitely high up on my want list for 9.

roji avatar Jul 07 '23 16:07 roji

@hisuwh one thing to be mindful of is that without the order by, you might start to see results appearing immediately in SSMS as it streams the results back, giving the impression the query has finished, when it might not have. Be sure to note the actual query time. With an order by there can be more buffering before it starts to stream results.

stevendarby avatar Jul 07 '23 16:07 stevendarby

@roji ok great. Is the EF version tied to the .NET version? We've only just upgraded to .NET 6 so we won't be moving to 9 any time soon.

@stevendarby this query isn't returning any rows so I don't think that will matter

hisuwh avatar Jul 07 '23 18:07 hisuwh

one thing to be mindful of is that without the order by, you might start to see results appearing immediately in SSMS as it streams the results back [...]

It's worth mentioning that this in itself is a reason to remove those ORDER BYs. In other words, if EF forces the database to start delivering results much later (by asking for ordered results), then that in itself negatively impacts result latency (regardless of the overall time taken to process the query etc.).

roji avatar Jul 07 '23 19:07 roji

Hi, @roji ! I'm running .NET Core 7.0 and EF Core version 7.0.10. I've read all the topic about include and Order. this is my issue: the more Include and ThenInclude in my Queryable, the more unnecessary OrderBy. This is my code image

Let say it match the condition skip != -1 and count != -1, desc is false

This is query: image

InFact that I only want it orders: ORDER BY t.Created0

Please let me know If I'm wrong at any step ? Or this is some thing wrong with EF core ? Please give me more information about this ?

quan-nm-nexlesoft avatar Aug 10 '23 05:08 quan-nm-nexlesoft

@quan-nm-nexlesoft nothing is wrong with either EF or your code - the extra ORDER BYs are there by-design, and are necessary for the way EF currently loads related collections (i.e. every time you do Include/ThenInclude on a collection navigation). All this is detailed above, and this issue is about changing that to not require the orderings.

Regardless, I highly recommend reading this section of the docs, and possibly considering using split queries.

roji avatar Aug 10 '23 07:08 roji

Running exec sp_updatestats on our database has actually resolved the querying performance for us

hisuwh avatar Aug 10 '23 10:08 hisuwh

A year has passed, and the is still no way to remov the sorting? Why not add the NoOrderBy method? We use EF Core 7 + Oracle, and unnecessary sorting affects performance

dmitriy-shleht avatar Aug 10 '23 13:08 dmitriy-shleht

Why not add the NoOrderBy method?

Because that's not possible given the current way EF loads related entities - please read the above.

Note that in many cases where users think that the orderings are affecting performance, it is actually something else (frequently cartesian explosion). In any case, this is something I definitely hope to tackle for EF 9.

roji avatar Aug 10 '23 14:08 roji

Just adding my 2 cents since I had to deal with such problems in the past few weeks.

We have experienced multiple performance problems due to resource constrained SQL servers when deploying our application to on-premise (aka no resources and slow IO subsystem - nothing we can do about it as we don't manage the SQL servers in question), by loading the entities separately without includes we shaved off a significant chunk of processing away from the SQL server. We don't really care about the order until the data hits the application side or if we are doing pagination/selecting top 1.

Note: sharing only the order by part of the query plan due to nda reasons For example here we are loading sensor data with a single join to get the sensor, loading time with order by id took a whopping 0.677s image

After separating the query into two queries (load data + load sensor information separately): 0.022s, a 30x improvement and cpu time was significantly lower image

This might? be an extreme case, but it was one we had to deal with. Unless the tables in question have like 10 rows it might not be worth to do it by hand, but anything else it seems like we have to load them manually.

Good reading on the subject from a credible source: https://www.brentozar.com/archive/2019/10/how-to-think-like-the-sql-server-engine-adding-an-order-by/

eero-dev avatar Dec 18 '23 05:12 eero-dev

@eero-dev thanks. Your data unfortunately doesn't add much given that it's very partial and doesn't expose actual queries and plans being performed; we're definitely aware that ORDER BY can have quite a cost, and this issue is about removing that specifically when loading related entities e.g. with Include, where EF itself is the one adding the ORDER BY for its own purposes.

The link is definitely useful - Brent Ozar is always a great source of info!

roji avatar Dec 18 '23 09:12 roji

@eero-dev thanks. Your data unfortunately doesn't add much given that it's very partial and doesn't expose actual queries and plans being performed; we're definitely aware that ORDER BY can have quite a cost, and this issue is about removing that specifically when loading related entities e.g. with Include, where EF itself is the one adding the ORDER BY for its own purposes.

The link is definitely useful - Brent Ozar is always a great source of info!

Sorry for being unable to share any concrete data as it is confidential, I can provide a repro though if needed, which shows that even a single Include with the Order By is quite costly

eero-dev avatar Dec 18 '23 09:12 eero-dev

Note: consider removing the orderings for split query as well, not just for single query.

roji avatar Feb 20 '24 09:02 roji