efcore icon indicating copy to clipboard operation
efcore copied to clipboard

Transform multiple ORs into a single SQL IN

Open roji opened this issue 1 year ago • 8 comments

We've recently been focusing a bit on reducing duplication of expressions in our generated SQL, especially around null semantics. For example, translating to "x IS NOT DISTINCT FROM y" instead of x = y OR (x IS NULL AND y IS NULL) would avoid duplicating x and y in the expression (https://github.com/dotnet/efcore/issues/29624); this is valuable especially when x/y aren't simple columns/parameters/scalars, but rather complex arbitrary expressions which may be expensive to evaluate. As another example, #12634 tracks transforming x >= y AND x <= z to x BETWEEN y AND z.

We could do the same by transforming multiple disjunctions into a single SQL IN (i.e. x = 3 OR x = 4 becomes x IN (3, 4); this would allow evaluating x only once, at least in some databases. This is essentially the same idea as #12634 for BETWEEN, and could probably even be implemented at the same time.

Note that the same caveats apply here as for BETWEEN - impure expressions should in theory not be collapsed together (though we don't currently handle such aspects in general), and identifying duplicated expression requires deep comparison, which can be expensive (#34149 would fix that).

/cc @ranma42

roji avatar Aug 22 '24 09:08 roji

Some other things to consider:

  • we should be careful about NULL behavior (the transformation is always safe if we do it after the nullability semantics transformation; otherwise it requires handling the NULL case)
  • as a first step we might want to restrict the cases we handle to comparisons against constants and/or parameters (aka avoid decisions for x = y OR x = z OR y = z)
  • it looks like in several cases we want to inspect several clauses in AND or OR together (this is more of a thought for a could-be intermediate representation); note that these operations are slightly different in C# and SQL (C# has guaranteed short-circuit semantics; SQL makes no such guarantee); in the case of SQL semantics we might be able to re-order and de-duplicate as we wish (which is basically the no-side-effects assumptions above)

ranma42 avatar Aug 22 '24 09:08 ranma42

we should be careful about NULL behavior (the transformation is always safe if we do it after the nullability semantics transformation; otherwise it requires handling the NULL case)

I'm not completely sure about this...

Thinking about it, this is actually something we may even want to do in preprocessing, as a universal optimization (not SQL-specific), where we'd transform comparisons in the pre-translated LINQ expression tree (a == b || a == c -> new[] { b, c }.Contains(a)). One problem with doing it like this in pre-processing, is that we can't cache the hashcode for optimizaing the necessary deep comparison.

But even if we do it in post-processing, what exact problem do you see with NULLs here? Isn't the semantics of multiple ORs identical to the semantics of SQL IN even before SqlNullabilityProcessor?

roji avatar Aug 22 '24 11:08 roji

we should be careful about NULL behavior (the transformation is always safe if we do it after the nullability semantics transformation; otherwise it requires handling the NULL case)

I'm not completely sure about this...

Thinking about it, this is actually something we may even want to do in preprocessing, as a universal optimization (not SQL-specific), where we'd transform comparisons in the pre-translated LINQ expression tree (a == b || a == c -> new[] { b, c }.Contains(a)). One problem with doing it like this in pre-processing, is that we can't cache the hashcode for optimizaing the necessary deep comparison.

Why can't we cache that? Is it only because of the SELECT mutability? Or are you thinking about issues with nullable parameters?

But even if we do it in post-processing, what exact problem do you see with NULLs here? Isn't the semantics of multiple ORs identical to the semantics of SQL IN even before SqlNullabilityProcessor?

I believe that the issue is the usual one: equality (and consequently inclusion) change meaning over time. Specifically, a == b || a == null succeeds in C# for a that is null, while a IN (b, NULL) is not satisfied (in SQL) if a is NULL. OTOH a == b || a == c and new[] { b, c }.Contains(a) are equivalent in C#, so it might not be a problem. Regardless, we should ensure that new[] { b, c, null }.Contains(a) is converted to a IN (b, c) OR a IS NULL in SQL (well... we could have some tricks to avoid duplicating a in this case: IIUC COALESCE(a IN (b, c), TRUE) would work as intended)

ranma42 avatar Aug 22 '24 12:08 ranma42

Why can't we cache that? Is it only because of the SELECT mutability? Or are you thinking about issues with nullable parameters?

I just mean that in preprocessing (which happens before we translate to SQL, there's no SELECT or anything yet), we're still dealing with the built-in LINQ expression node types, e.g. BinaryExpression as opposed to our SqlBinaryExpression; and we can't modify that expression in order to do the caching...

I believe that the issue is the usual one: equality (and consequently inclusion) change meaning over time. Specifically, a == b || a == null succeeds in C# for a that is null, while a IN (b, NULL) is not satisfied (in SQL) if a is NULL.

So again, assuming we do the transformation in post-processing - before SqlNullabilityProcessor - we'd be transforming WHERE a = 1 OR a = 2 (as nested SqlBinaryExpressions) to WHERE a IN (1, 2). SqlNullabilityProcessor would later run on this and perform the regular nullability expansion for InExpression (see answer below). So I think everything should be fine here.

In fact, the transformation would also pick up our IS NULL representation (which is represented via SqlUnaryExpression), and integrate that into the resulting InExpression as well, so from WHERE a = 1 OR a IS NULL you'd get WHERE as IN (1, NULL), which would then get expanded out in SqlNullabilityProcessor (unless UseRelationalNulls is on).

Regardless, we should ensure that new[] { b, c, null }.Contains(a) is converted to a IN (b, c) OR a IS NULL in SQL (well... we could have some tricks to avoid duplicating a in this case: IIUC COALESCE(a IN (b, c), TRUE) would work as intended)

This should already be the case - I worked on doing full nullability semantics for InExpression in EF Core 8.0, it's probably one of the more complex parts of SqlNullabilityProcessor. There may be bugs there (or opportunities for improvement) but that's orthogonal to this issue, I think.

roji avatar Aug 22 '24 12:08 roji

Note that we already have an inferior version of this proposal implemented today in SqlExpressionSimplifyingExpressionVisitor (except that it only works on ColumnExpressions, and only for equality comparisons directly within the same OR node). Note also that we have a bug where we incorrectly deduplicate values if they're equal in .NET; but equality in the database may evaluate differently (e.g. case-insensitive collation), so we shouldn't do that (see https://github.com/dotnet/efcore/issues/34862#issuecomment-2403632039).

roji avatar Oct 09 '24 23:10 roji

Following an internal conversation, the following shows that ~SQL Server~ PostgreSQL doesn't automatically

CREATE TABLE Data(Id INTEGER NOT NULL PRIMARY KEY, Foo int NOT NULL);
EXPLAIN SELECT * FROM Data WHERE Foo IN (3, 4);
EXPLAIN SELECT * FROM Data WHERE Foo = 3 OR Foo = 4;

This gives the following two plans:

+----------------------------------------------------+
|QUERY PLAN                                          |
+----------------------------------------------------+
|Seq Scan on data  (cost=0.00..38.25 rows=23 width=8)|
|  Filter: (foo = ANY ('{3,4}'::integer[]))          |
+----------------------------------------------------+

+----------------------------------------------------+
|QUERY PLAN                                          |
+----------------------------------------------------+
|Seq Scan on data  (cost=0.00..43.90 rows=23 width=8)|
|  Filter: ((foo = 3) OR (foo = 4))                  |
+----------------------------------------------------+

In other words, on PostgreSQL, IN isn't executed in the same way as the equivalent decomposed ORs, and is slightly more efficient.

One possible objection against doing collapsing multiplle ORs to a single IN in EF, is that the user has a means of expressing IN directly, via LINQ Contains, and we generally don't make an effort to optimize badly-written LINQ queries. This is in contrast to the proposed BETWEEN optimization (#12634), where there's no C# way to express it. A counter-argument would be that EF itself (rather than the user) could internally produce the multiple ORs, at which point this optimization is useful, though that might be a bit far-fetched.

roji avatar Oct 10 '24 08:10 roji

Following an internal conversation, the following shows that SQL Server doesn't automatically

[...]

In other words, on PostgreSQL, IN isn't executed in the same way as the equivalent decomposed ORs, and is slightly more efficient.

Is this PostgreSQL or SqlServer?

One possible objection against doing collapsing multiplle ORs to a single IN in EF, is that the user has a means of expressing IN directly, via LINQ Contains, and we generally don't make an effort to optimize badly-written LINQ queries. This is in contrast to the proposed BETWEEN optimization (#12634), where there's no C# way to express it. A counter-argument would be that EF itself (rather than the user) could internally produce the multiple ORs, at which point this optimization is useful, though that might be a bit far-fetched.

Another case that cannot easily be expressed right now is the "row value equality" (like in https://github.com/dotnet/efcore/issues/26822 with equals) but I am not even sure if that is supposed to be included in this issue 😅

ranma42 avatar Oct 14 '24 07:10 ranma42

Is this PostgreSQL or SqlServer?

Sorry, PostgreSQL - corrected the above.

Another case that cannot easily be expressed right now is the "row value equality"

I think that one should be possible (but isn't currently supported) via new ValueTuple(x,y).Equals(new ValueTuple(a,b). Here I really doubt there's any advantage in the row value comparison compared to the decomposed multiple equalities, though...

roji avatar Oct 14 '24 15:10 roji

we should be careful about NULL behavior (the transformation is always safe if we do it after the nullability semantics transformation; otherwise it requires handling the NULL case)

But even if we do it in post-processing, what exact problem do you see with NULLs here? Isn't the semantics of multiple ORs identical to the semantics of SQL IN even before SqlNullabilityProcessor?

I believe that the issue is the usual one: equality (and consequently inclusion) change meaning over time. Specifically, a == b || a == null succeeds in C# for a that is null, while a IN (b, NULL) is not satisfied (in SQL) if a is NULL. OTOH a == b || a == c and new[] { b, c }.Contains(a) are equivalent in C#, so it might not be a problem. Regardless, we should ensure that new[] { b, c, null }.Contains(a) is converted to a IN (b, c) OR a IS NULL in SQL (well... we could have some tricks to avoid duplicating a in this case: IIUC COALESCE(a IN (b, c), TRUE) would work as intended)

I think you have that backwards. a IN (b, NULL) works the same as C#, because it compiles out to a = b OR a = NULL which then becomes a = b OR UNKNOWN, which then simply becomes a = b.

It's the NOT case you need to worry about: in C#, !(a == b || a == null) works via de Morgan's Laws the same as a != b && a != null, but in SQL this is going to mess up: a NOT IN (b, NULL) compiles to NOT (a = b OR a = NULL), that's the same with de Morgan as a <> b AND a <> NULL, which then becomes a <> b AND UNKNOWN which then becomes just UNKNOWN always fails a WHERE predicate.

Further reading: NULL values inside NOT IN clause

Charlieface avatar Jun 10 '25 17:06 Charlieface

I think you have that backwards. a IN (b, NULL) works the same as C#, because it compiles out to a = b OR a = NULL which then becomes a = b OR UNKNOWN, which then simply becomes a = b.

@Charlieface I am not fully understanding your statement; I am afraid I am missing some context. When you say that the SQL expression a IN (b, NULL) works the same as C#, what is the C# expression you are matching it against?

Note that my example was comparing:

  • in C# a == b || a == null which is true when a is null
  • in SQL a IN (b, NULL) , which evaluates to NULL in SQL when a is NULL

ranma42 avatar Jun 11 '25 14:06 ranma42

I think you have that backwards. a IN (b, NULL) works the same as C#, because it compiles out to a = b OR a = NULL which then becomes a = b OR UNKNOWN, which then simply becomes a = b.

@Charlieface I am not fully understanding your statement; I am afraid I am missing some context. When you say that the SQL expression a IN (b, NULL) works the same as C#, what is the C# expression you are matching it against?

Note that my example was comparing:

* in C# `a == b || a == null` which is `true` when `a` is `null`

* in SQL `a IN (b, NULL)` , which evaluates to `NULL` in SQL when `a` is `NULL`

I see, no I wasn't looking at the a left side being null, I was talking about the (b, c) right side.

Yes, if the left side is nullable then you need a separate OR a IS NULL AND b IS NULL predicate, for each b that can be nullable. If there are no list items that are nullable then you can ignore this problem.
So in the case of a == b || a == c where a and b can be null but not c then you can do WHERE (a IN (b, c) OR a IS NULL AND b IS NULL).
That is to say, yes it's a bit more complex, but it can still work.

I was referring to the right side. When there is a NULL inside the IN list and you are using NOT IN then there is the added complication that NOT IN completely doesn't work in the way you'd expect, because if there are any NULLs in the list then the whole check fails.
So for example, if you have !(a == b && a == c) and b is nullable (regardless of a nullable or not), then you can't do WHERE NOT (a IN (b, c) AND b IS NOT NULL), you need to remove b from the IN list entirely. Note that switching to NOT IN and OR doesn't help here, because of de Morgan, you also can't do WHERE (a NOT IN (b, c) OR b IS NULL).

Charlieface avatar Jun 11 '25 15:06 Charlieface

So for example, if you have !(a == b && a == c) and b is nullable (regardless of a nullable or not), then you can't do WHERE NOT (a IN (b, c) AND b IS NOT NULL), you need to remove b from the IN list entirely. Note that switching to NOT IN and OR doesn't help here, because of de Morgan, you also can't do WHERE (a NOT IN (b, c) OR b IS NULL).

I believe !(a == b && a == c) is a typo for !(a == b || a == c), so I'll discuss this case.

Let's assume a and b are nullable, while c is not nullable. I believe the C# expression (a == b || a == c) can be safely translated to something like

CASE
  WHEN a IS NULL THEN b IS NULL
  WHEN b IS NULL THEN a IN (c)
  ELSE a IN (b, c)
END

which is quite terrible, but seems to be correct (I tried to write it down based on the rules from the Sqlite docs for [NOT] IN)

The negation (!(a == b || a == c)) should be quite straightforward:

CASE
  WHEN a IS NULL THEN b IS NOT NULL
  WHEN b IS NULL THEN a NOT IN (c)
  ELSE a NOT IN (b, c)
END

In practice, I am afraid such a translation would become quite unreasonable as the number of nullable RHS variables increases (LHS can be handled easily and possibly even simplified), so I think a good intermediate step would be to only include in the RHS non-nullable variables.

Positive example (a == nullable1 || .. || a == nullableN || ... || a == nonNullable1 || ... || a == nonNullableM):

a IS NOT DISTINCT FROM nullable1 OR
...
a IS NOT DISTINCT FROM nullableN OR
(a IS NOT NULL AND a IN (nonNullable1, ..., nonNullableM))

Negated example !(a == nullable1 || .. || a == nullableN || ... || a == nonNullable1 || ... || a == nonNullableM):

a IS DISTINCT FROM nullable1 AND
...
a IS DISTINCT FROM nullableN AND
(a IS NULL OR a NOT IN (nonNullable1, ..., nonNullableM))

ranma42 avatar Jun 11 '25 16:06 ranma42