EntityFramework.Docs
EntityFramework.Docs copied to clipboard
encourage WHERE even if splitting
- the code examples will bring back all the Blogs & Posts rows. Better that examples have a WHERE Rating>=4 or somesuch to show how that will also be present in the 2nd SELECT split.
- the "cartesian explosion" duplicates only applies when querying Posts with an INCLUDE("Blog") that will have dups of Blogs with multiple Posts, so I would suggest you revise the examples given.
- surely the EF code generation should impose a DISTINCT in the split statements to suppress such dups?
- I am surprised that the generated split code shown uses the b.BlogId and not the p.BlogId - yes I know they are the same per the ON clause but it rankles when you are really looking for all the p.* fields to match the EF entity definition [just requires more brain-cycles to validate the resulant match!]
Document details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
- ID: 43baa9f3-c415-c328-6c10-53aabaaec652
- Version Independent ID: 4e789a1b-aa88-d194-dc2e-0a23251250d7
- Content: Single vs. Split Queries - EF Core
- Content Source: entity-framework/core/querying/single-split-queries.md
- Product: entity-framework
- Technology: entity-framework-core
- GitHub Login: @smitpatel
- Microsoft Alias: avickers
the code examples will bring back all the Blogs & Posts rows. Better that examples have a WHERE Rating>=4 or somesuch to show how that will also be present in the 2nd SELECT split.
Bringing back all Blogs & Posts or not seems very orthogonal to what's being explained here. My personal preference is to not burden the user with extra concepts (e.g. a WHERE clause, discussing of bringing back all or some rows), where a simpler sample will do; better to discuss one concept at a time. Besides, there certainly are valid scenarios where you really do want to bring back all the Blogs & Posts, so I'm not sure why we wouldn't show that here.
the "cartesian explosion" duplicates only applies when querying Posts with an INCLUDE("Blogs") that will have dups of Blogs with multiple Posts, so I would suggest you revise the examples given.
I don't follow. The sample shows a scenario where Blogs are queried, and their Posts are included. If the same Blog has two Posts, you get two rows back where the Blog properties are duplicated (that's "cartesian explosion"). What's missing or problematic?
surely the EF code generation should impose a DISTINCT in the split statements to suppress such dups?
Can you explain what you mean? The point is that if the same Blog has two Posts, you get two rows back where the Blog properties are duplicated. However, the Post portion of each row is different, so DISTINCT wouldn't do anything here (DISTINCT only works on entire rows).
the docs relate to an app doing var allblogs=ctx.Blogs.Include("Posts").OrderBy(b=>b.BlogId).ToList() and the Blogs columns would be repeated in the non-split case[you are correct]
I was thinking about the orthogonal [as you like that word!] case of var allposts=ctx.Posts.Include("Blog").OrderBy(p=>p.PostId).ToList() which likewise produces an NxM explosion for non-split. However in the split case I would expect gen code to be SELECT [p].[PostId], [p].[AuthorId], [p].[BlogId], [p].[Content], [p].[Rating], [p].[Title] FROM [Posts] AS [p] ORDER BY [p].[PostId]
SELECT DISTINCT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url] FROM [Blogs] AS [b] INNER JOIN [Posts] AS [p] ON [p].[BlogId] = [b].[BlogId] ORDER BY [b].[PostId]
Please note
- I omitted the 2nd BlogId in result columns (superfluous)
- note the Distinct so you don't get multiples of the Blog cols
Even if you choose to junk my other points (1-4) at the very least please change your example code from "Post" to "Posts" !
However in the split case I would expect gen code to be [...]
Just to be sure I understand, is this a proposal for an alternative translation for split querying, rather than any docs suggestion?
at the very least please change your example code from "Post" to "Posts"
Sure, do you want to submit a PR for that?
I still think sticking a WHERE clause is justified so devs see gen'ed code [imho it doesn't distract any more than the OrderBy clause] I haven't enabled logging [to see what current implementation code actually does] but I was concentrating on improving the docs rather than changing code implementation. I am about to go on short vacation so unfortunately can't explore code/docs further yet [ditto any PR for Posts] but will return in 1 week and will revisit if you haven't done it all by then [I will study #4044 with interest later!]. Thanks.
SELECT DISTINCT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url]
@DickBaker I'm not sure what the DISTINCT operator is supposed to do here: aren't these rows already guaranteed to be distinct? This 2nd query just selects out the matching rows from [Blogs], there are already are no "duplicates" there...
However, I've opened https://github.com/dotnet/efcore/issues/29174 for the idea of doing split queries when starting from Posts, thanks.
Doing the Post->Posts change as part of #4051.
what about that "Post"->"Posts" promise in #4051 that Uncle Arthur approved?
that GH page still shows "Post". Maybe its still in the pipeline and yet to reach the spotlight? https://github.com/dotnet/EntityFramework.Docs/blob/main/entity-framework/core/querying/single-split-queries.md#split-queries-1
Thanks for undertaking anyway as I didn't relish the responsibility to have raise a PR [no I don't fancy getting my name in lights as a past contributor!], especially as you are intimately involved in all the .NET MD syntax [that I am reticent about using as a novice] https://learn.microsoft.com/en-us/contribute/dotnet/dotnet-style-guide
regards, Dick
DISTINCT ...
in the case that I raised arising from var allposts=ctx.Posts.Include("Blog").OrderBy(p=>p.PostId).ToList()
where you have [say] 1 Blogs row and 2 child Posts the 2nd split would surely surface parent Blog data in duplicate (because of the innerjoin to the 2 child rows) ? hence my assertion that you need the DISTINCT operator. Did I miss something [sorry I still haven't cobbled up the LOG to answer the Q] ?
back on the subject of plural/singular table names [often argued by zealots to wee small hours, usually involving beer] we are talking about "Post" vs "Posts" [I have always favoured plural in db tables and "DbSet<Post> Posts;" with the actual class object being Post, and I agree with EF's singular for linktables like PostTag] ..
reading new issue #4053 and the related docs https://github.com/dotnet/EntityFramework.Docs/blob/main/entity-framework/core/modeling/relationships.md
I feel that this definitive source of truth should broach that subject of plural/singular [or at lease have an offpage link to such discussion/rationale].
Yes you would be right if you called me a zealot, but I would shun the term bigot (no, don't ask me for precise definitions of those 2 terms!] But of course I would also be pleased to share beers with you at an opportune moment on this/any subject.
Re DISTINCT, you're right, I somehow managed to misread your SQL - DISTINCT would indeed remove duplicate Blog rows when the same Blog has more than one Post. It's worth mentioning that DISTINCT adds quite a bit of extra work on the database - in many cases just getting the multiple rows may be more efficient (depending on the average cardinality and the size of the duplicated Blog propertieS).
Re Post/Posts, #4051 isn't yet merged (and even after it is merged, changes are only pushed to live every now and then). For such small changes, we generally recommend to just submit a PR - there's not even any actual markdown involved here.
Having agreed with the PostTag usage found here https://github.com/dotnet/EntityFramework.Docs/blob/main/entity-framework/core/modeling/relationships.md
I now find a rival "PostTags" usage here https://github.com/dotnet/EntityFramework.Docs/blob/main/samples/core/Modeling/Relationships/FluentAPI/ManyToManyShared.cs?name=ManyToManyShared
so I urge you to ensure consistent message. BTW I have found no explicit indication as to whether the PK of such a join table is (PostId,TagId) or (TagId,PostId) nor whether devs should consider enforcing an additional IX (TagId) or (PostId) respectively [nor how to specify this].
Likewise I am not satisfied that the docs adequately describe the link-table situations of (FKA,FKB) vs the additiional payload columns [and personally I have difficulty accepting Arthur's Dictionary<string,object> flattening as I suspect boxing and loose-typing [but this may be ignorance on my side!]
The changes made EF6->EFCore over complex types (e.g. Address) are also missing.
Re DISTINCT I strongly lean to making the db do the work rather than experience elongated network traffic plus client-side EF having to parse+apply each record. We are both experienced perf practitioners and professors of "it depends", but AFAIK the QP to add the Distinct may add another Sort step but judicial choice of IX (as prev para) should make this cheap.
Reminder that my comments largely to the docs rather than EF code (with possible exception of that DISTINCT wrinkle] BTW I thoroughly support your Issue #29174 to the EF codebase
Your opinions welcome as ever!
Re DISTINCT I strongly lean to making the db do the work rather than experience elongated network traffic plus client-side EF having to parse+apply each record.
I'd be very wary of such a sweeping statement; it's generally preferable to remove load from the database, since it's typically much easier to scale the app tier vs. the database tier (i.e. you usually have multiple app servers to a single database server). DISTINCT has quite an impact on how the database constructs the query plan, and generally requires lots of work (see https://github.com/dotnet/efcore/issues/29171 for some similar thoughts on EF's injection of orderings for single query). Outside of EF, I'd carefully take into account actual cardinality and column sizes before adding a DISTINCT; unfortunately, with EF we must make a decision to either do it or not.
Regardless of all that, posting the multiple unrelated doc suggestions in the same issue makes it harder for us to track work and discuss each one. I'd urge opening different issues for different problems, and considering submitting PRs fixing issues, especially when they're small.
Ping @roji
I've gone through this, and I don't think there's anything actionable left here... The Post->Posts change has been merged and applied, and I've just submitted #4201 to generally provide more information on cartesian explosion vs. data duplication.
I do find interesting the idea of adding DISTINCT to the split query SQL, but I'm still not convinced that should be the default behavior. Data duplication doesn't tend to be a big problem - unless huge columns exist on the principal; and adding DISTINCT creates lots of extra work to the database, which in many (and possibly most) cases doesn't justify the network traffic optimization.
So I'm going to go ahead and close this, but as always am happy to discuss further.