efcore icon indicating copy to clipboard operation
efcore copied to clipboard

Consider changing the default handling of parameterized collections (back to constantization, RECOMPILE, inline collection of parameters...)

Open roji opened this issue 1 year ago • 23 comments

In 8.0, we changed the default translation for parameterized lists from constantization to parameterization (e.g. with OPENJSON); this is better for plan/query caching (single SQL), but can produce bad query plans (#32394). For 9, we're adding extensive means to control the parameterization vs. constantization behavior, both at the global context options level (#34344) and at the per-collection level (EF.Constant, EF.Parameter (#34345)).

For 10, we plan to reexamine what the default should be. The problem with the current full parameterization, is that the bad plans it occasionally causes can create a significant perf regression, with queries that previously executed near instantaneously (with constants) now taking seconds or more. In contrast, the performance issues cause by constantization are much more constant and limited (continuous re-planning, query cache bloat leading to possible less buffer cache memory, and premature eviction of other query plans).

In addition, the main problem with constantization is caused by SQL Server's behavior to cache query plans on the very first execution. Other databases do not typically do this (e.g. in PostgrSQL, Npgsql can be configured to prepare/cache only after X executions), and SQL Server itself has the RECOMPILE query hint for not caching, as well as an ad-hoc workload mode (see also this blog post), which makes SQL Server behave better around single-use queries (but this is a global server option). So the plan cache bloat seems like it can be solved for SQL Server (and may not exist for other databases); the remaining issue is only the constant replanning causes by different SQLs, but this is likely to be minor compared to the better plans produced thanks to the database visibility into the cardinality.

So a comprehensive cross-database investigation into both plan caching/bloat and the impact of parameterization/constantization is needed here.

Note that instead of returning to full constantization by default (pre-8.0), we could choose to switch to inline collection of parameters.

To summarize... For translating the following:

var ids = ...; // List of ids from somewhere
var blogs = await context.Blogs.Where(b => ids.Contains(b.Id)).ToListAsync();

... here are our options:

-- 1) Full parameterization (current translation):
SELECT * FROM foo WHERE x IN (SELECT v FROM OPENJSON(@values))

-- 2) Simple constantization (pre-8.0 translation):
SELECT * FROM foo WHERE x IN (1, 2, 3);

-- 3) Simple constantization with RECOMPILE on SQL Server (no plan cache bloat):
SELECT * FROM foo WHERE x IN (1, 2, 3) RECOMPILE;

-- 4) Inline collection with 3 parameters:
SELECT * FROM foo WHERE x IN (@p1, @p2, @p3);

-- 5) Inline collection with 3 parameters, with bucketization (optimization for Contains only):
SELECT * FROM foo WHERE x IN (@p1, @p2, @p3, @p4, @p5); -- @p4 and @p5 contain the same duplicated value as @p3

roji avatar Aug 03 '24 07:08 roji

In addition, the main problem with constantization is caused by SQL Server's behavior to cache query plans on the very first execution.

Is this not mitigated via Parameter Sensitive Plan optimization?

IanKemp avatar Aug 05 '24 11:08 IanKemp

@IanKemp how so? Constantization (which you're quoting) means you don't have parameters at all in the query, only constants; so different collection values mean different SQLs, and therefore different plans (so plan bloat).

roji avatar Aug 05 '24 12:08 roji

It would help me a lot to add "Roslyn analyzer" in Visual Studio that triggers a warning when a call to Contains on a string array without EF.Collation or EF.Constant is detected in a Where clause.

kamiyn avatar Oct 10 '24 02:10 kamiyn

@kamiyn doing Contains on a string array without EF.Collation or EF.Constant is in many cases perfectly fine, so a warning isn't appropriate for that. In any case, if we do this issue in EF 10, EF.Constant would no longer be needed in this context.

roji avatar Oct 10 '24 10:10 roji

We recently had an issue with that parameterization way in a large table (+1m records). We found out that, in that select v from OPENJSON(@v), deserialization to an array was happening with each comparison. When trying with a CTE, it was way faster. For two collections (where X in col1 and Y in col2), a request that was taking around 25s, went down to 1.2s.

So question: why aren't you considering this kind of request ? Does it have some drawbacks ? WITH values AS (SELECT v FROM OPENJSON (@values)), SELECT * FROM foo INNER JOIN values ON values.v = x

Sshnyari avatar Feb 08 '25 11:02 Sshnyari

deserialization to an array was happening with each comparison

That doesn't make sense to me, how do you know that's the case?

roji avatar Feb 08 '25 13:02 roji

that's what we understood from the execution plan. Most of the time spent running that OPENJSON function. Maybe our understanding was wrong, but the fact is, changing it to a CTE improved the perfs by a lot. By the way, that array that was deserialized was quite big.

Sshnyari avatar Feb 09 '25 23:02 Sshnyari

Maybe our understanding was wrong, but the fact is, changing it to a CTE improved the perfs by a lot.

Can you put together a quick BenchmarkDotNet benchmark showing this?

roji avatar Feb 09 '25 23:02 roji

I cannot copy/paste the results of the benchmark directly, since I can't contribute from my company's GitHub account. I got with a short run:

  • 25.352s with the request that is generated by EF core
  • 3.140s with the CTE request

Here are the parameters I tested with a few days ago:

  • EF core 8.0.11
  • SQL on azure (SQL server 12.0.2000.8)
  • a table with 1.2m records having the following columns: Id, ExtId1, ExtId2, Date, Data (JSON column)
  • we have indexes on all columns (except data) + an index on Date, ExtId1 and ExtId2 with INCLUDE(Id, Data)
  • first collection is a list with 700 items of IDs of ExtId1
  • second collection is a list with 14k items of IDs of ExtId2
  • the exact request that is generated by EF core 8 is something like SELECT t.Id, t.Data FROM Table t WHERE t.Date <= @date1 AND t.Date > @date2 AND EXISTS (SELECT 1 FROM OPENJSON (@extids1) WITH (value nvarchar(max) '$') AS a WHERE a.value = r.ExtId1) AND EXISTS (SELECT 1 FROM OPENJSON (@extids2) WITH (value nvarchar(max) '$') AS b WHERE b.value = r.ExtId2)
  • the CTE request is WITH ext1 AS (SELECT value FROM OPENJSON (@extids1)), ext2 AS (SELECT value FROM OPENJSON (@extids2)), SELECT t.Id, t.Data FROM Table t INNER JOIN ext1 a ON a.value = t ExtId1 INNER JOIN ext2 a ON a.value = t ExtId2 WHERE t.Date <= @date1 AND t.Date > @date2
  • I ran the benchmark with EF core 9.0.1, but that doesn't really matter because I used raw queries
  • from my machine, transfer time is higher, and that was included in the benchmarking results

I also tried to play a bit with indexes. The CTE request is hugely affected by indexes, while the other one doesn't change regardless of the indexes we have.

Sshnyari avatar Feb 10 '25 17:02 Sshnyari

Those are very different queries, regardless of the CTEs: the first one uses OPENJSON(@extids1) WITH (value nvarchar(max) '$'), whereas the second one uses OPENJSON(@extids1) - without the WITH clause; that can be essential to how the query is planned, and I strongly suspect is the reason for the performance difference. In other words, it's likely that this has nothing to do with the CTE.

To actually check whether the CTE is important, you need to compare SQLs that are otherwise identical, differing only by the use/non-use of the CTE.

roji avatar Feb 10 '25 23:02 roji

Sorry, didn't think it made a difference. That was just something I omitted when writing the ticket from my phone. I actually run both requests with OPENJSON(@extids1) WITH (value nvarchar(max) '$')

Sshnyari avatar Feb 11 '25 12:02 Sshnyari

@Sshnyari it's very hard to work with partial, approximate SQL snippets which may or may not be accurate... I'd be interested in investigating this further, but as I wrote above, if you want to help with this, then we need a runnable benchmark. Otherwise you're basically asking me to try to recreate everything based on the approximate information you've posted above; and since with performance things have to be extremely precise, there's a very good chance I'd get something a little bit different than in your setup, and it wouldn't reproduce.

roji avatar Feb 11 '25 12:02 roji

Below is a table with some databases and how they handle implicitly caching execution plans. I also added column for explicit prepare for completeness.

Only MS SQL, Oracle, IBM DB2 and Firebird (starting with version 5) do implicit caching of execution plans, hence theoretically suffer from bloat when constants are used. Firebird does not have RECOMPILE equivalent. And based on my research, neither does Oracle or IBM DB2.

Database Global Cache Explicit Prepare Possible
MS SQL Yes N/A
Postgres No Yes
MySQL No Yes
Firebird Yes (FB 5) / No Yes
MariaDB No Yes
Oracle Yes N/A
SQLite No Yes
IBM DB2 Yes N/A

cincuranet avatar Apr 10 '25 08:04 cincuranet

Thanks @cincuranet. For database which auto-cache plans ("global cache") and which don't have an opt-out like RECOMPILE, it may be good to confirm that their EF provider can control the default behavior (i.e. to use something like JSON rather than inline constants).

roji avatar Apr 10 '25 11:04 roji

Both Oracle and IBM DB2 have JSON_TABLE (JSON_QUERY, ...) function, that is doing similar thing as OPENJSON. Firebird 5 does not have any JSON functions, this is expected in Firebird 6.

cincuranet avatar Apr 10 '25 12:04 cincuranet

OK. The questions in my mind here are:

  1. Are Oracle and/or DB2 using JSON today to transfer scalar (primitive) collections, or are they just inlining (which is the default if they haven't done any work).
  2. If we switch back to inline constants by default (as this issue tracks), do these providers still have the option of saying "no, I actually want to use JSON by default" - because of the plan bloating problem. I still suspect the better decision would be to use inline constants (since the plan bloat problem is generally less bad than the bad plans we get with JSON collections which the planner can't see into). But ideally it should still be an option.

roji avatar Apr 10 '25 13:04 roji

    1. Oracle (Oracle.EntityFrameworkCore) is using IN today.
    2. IBM DB2 (IBM.EntityFrameworkCore) is using IN today (still targeting EF 8).
  1. I'm not sure I follow. Can you elaborate? You mean like a knob exposed to providers to set what the default should be?

cincuranet avatar Apr 10 '25 16:04 cincuranet

I'm not sure I follow. Can you elaborate? You mean like a knob exposed to providers to set what the default should be?

Yeah - I'm basically saying that a provider should ideally be able to decide what its default here should be, because different databases may have different characteristics here. I don't think it would necessarily be yet another knob, more just ensuring that if the user doesn't specify a default (via the user-facing context option we already have), the provider can still pick the behavior they want in code (which really should be the case - it's more about verifying).

roji avatar Apr 14 '25 07:04 roji

I think this is a perfect use for IDbContextOptionsExtension.ApplyDefaults method. On RelationalOptionsExtension we already have ParameterizedCollectionTranslationMode? ParameterizedCollectionTranslationMode. This together looks like a great place for provider to set the default behavior, if required.

cincuranet avatar Apr 14 '25 08:04 cincuranet

The PR so far looks promising.

It would be nice if we could change it per query. For example a 2nd parameter in EF.Parameter(var, ParameterizedCollectionTranslationMode.ParameterizeExpanded)

rgroenewoudt avatar Jun 16 '25 09:06 rgroenewoudt

@rgroenewoudt that's indeed something we're considering.

roji avatar Jun 16 '25 10:06 roji

For our internal ORM we recently switched from using Table-Valued Parameters to a combination of parameters with bucketization and OPENJSON. Our learnings thus far:

  • SQL Server generates bad query plans for IN (@p1, @p2, ...) OR IN (@p3, @p4, ...) so you might want to disable the parameters optimization if there is an OR and always use OPENJSON in this case.
  • Some other queries regressed with IN (@p1, @p2, ...) and we could fix it by enforcing OPENJSON (So +1 for providing an option for this in EF Core! We use EF Core too and if this is not supported, we will have to implement this ourselves)
  • Some queries regressed due to OPENJSON, which in most cases we fixed by adding OPTION (LOOP JOIN, MAXDOP 1). This however should not be relevant for EF Core since currently OPENJSON is used, so there can't be any regression.

m-gasser avatar Jun 16 '25 12:06 m-gasser

@m-gasser can you provide some specifics on the cases where you saw worst query plans for IN (@p1, @p2,...) compared to OPENJSON?

roji avatar Jun 16 '25 13:06 roji

@m-gasser can you provide some specifics on the cases where you saw worst query plans for IN (@p1, @p2,...) compared to OPENJSON?

Sorry @roji for not getting back to you earlier. We were still experimenting with various options back then and I wanted to avoid communicating something that gets obsoleted soon. We however found no good alternative and now will always use OPENJSON when multiple IN statements are on different OR branches.

One of our DBAs created the following reproducible to show the issue. With the separate parameters, SQL Server will decide to use a scan whilst with OPENJSON a seek is used, which performs better. We did not stumble over problematic queries when IN(...) AND IN(...) is used but in the few places where IN(...) OR IN(...) is used, parametrization seems to be consistently worse than OPENJSON .

use tempdb;

set nocount on;

drop table if exists dbo.test;

create table dbo.test (
 id int identity not null,
 int_col1 int null,
 int_col2 int null,
 filler char(500) not null default 'a'
);

insert into dbo.test(filler) values('a');
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;
insert into dbo.test(filler) select filler from dbo.test;

update dbo.test
set int_col1 = id, int_col2 = id;

create unique clustered index ucix on dbo.test(id);
create nonclustered index ix_col1 on dbo.test(int_col1);
create nonclustered index ix_col2 on dbo.test(int_col2);

update statistics dbo.test;

declare
 @int_col1_0  int=1
  , @int_col1_1  int=11
  , @int_col1_2  int=111
  , @int_col1_3  int=1111
  , @int_col1_4  int=2
  , @int_col1_5  int=22
  , @int_col1_6  int=222
  , @int_col1_7  int=2222
  , @int_col1_8  int=3
  , @int_col1_9  int=33
  , @int_col1_10 int=333
  , @int_col1_11 int=3333
  , @int_col1_12 int=4
  , @int_col1_13 int=44
  , @int_col1_14 int=444
  , @int_col1_15 int=4444
  , @int_col1_16 int=5
  , @int_col1_17 int=55
  , @int_col1_18 int=555
  , @int_col1_19 int=5555

  , @int_col2_0  int=1
  , @int_col2_1  int=11
  , @int_col2_2  int=111
  , @int_col2_3  int=1111
  , @int_col2_4  int=2
  , @int_col2_5  int=22
  , @int_col2_6  int=222
  , @int_col2_7  int=2222
  , @int_col2_8  int=3
  , @int_col2_9  int=33
  , @int_col2_10 int=333
  , @int_col2_11 int=3333
  , @int_col2_12 int=4
  , @int_col2_13 int=44
  , @int_col2_14 int=444
  , @int_col2_15 int=4444
  , @int_col2_16 int=5
  , @int_col2_17 int=55
  , @int_col2_18 int=555
  , @int_col2_19 int=5555
  
  , @json1 nvarchar(4000) = N'[1,2,3,4,5,11,22,33,44,55,111,222,333,444,555,1111,2222,3333,4444,5555]'
  , @json2 nvarchar(4000) = N'[1,2,3,4,5,11,22,33,44,55,111,222,333,444,555,1111,2222,3333,4444,5555]'
;

select
 *
from
 dbo.test as t
where 
 (t.int_col1 in (@int_col1_0,@int_col1_1,@int_col1_2,@int_col1_3,@int_col1_4,@int_col1_5,@int_col1_6,@int_col1_7,@int_col1_8,@int_col1_9,@int_col1_10,@int_col1_11,@int_col1_12,@int_col1_13,@int_col1_14,@int_col1_15,@int_col1_16,@int_col1_17,@int_col1_18,@int_col1_19) and t.int_col1 is not null)
 or (t.int_col2 in 
```(@int_col2_0,@int_col2_1,@int_col2_2,@int_col2_3,@int_col2_4,@int_col2_5,@int_col2_6,@int_col2_7,@int_col2_8,@int_col2_9,@int_col2_10,@int_col2_11,@int_col2_12,@int_col2_13,@int_col2_14,@int_col2_15,@int_col2_16,@int_col2_17,@int_col2_18,@int_col2_19) and t.int_col2 is not null)
;

select
 *
from
 dbo.test as t
where 
 (t.int_col1 in (select [value] from openjson(@json1) with(value int '$')) and int_col1 is not null)
 or (t.int_col2 in (select [value] from openjson(@json2) with(value int '$')) and int_col2 is not null)
;

drop table if exists dbo.test;

m-gasser avatar Jul 22 '25 17:07 m-gasser

@m-gasser thanks for the info - we're heads down at the moment finalizing the 10 release, but this is valuable. In any case, in 10 users will be able to select the collection parameterization mode both at the global default level and on a per-query level; this is unfortunately too complex a subject and EF simply lacks the appropriate information to make the right choice, so users will have to intervene sometimes for optimal performance.

roji avatar Jul 23 '25 07:07 roji

Thanks for providing the per-query level option, this will be very useful for us, to be able to fine tune queries, if our DBAs notice an issue!

this is unfortunately too complex a subject and EF simply lacks the appropriate information to make the right choice, so users will have to intervene sometimes for optimal performance.

Sounds reasonable. The right place to fix this would be the database anyways and not the ORM. I wish the SQL Server team would give fixing rough corners like this much higher priority.

m-gasser avatar Jul 23 '25 08:07 m-gasser