linq2db Improve sequence building

This is a #3913 redone on latest v6 master branch. I'm gonna copy-paste my original description below:

Limitations of current design

Today ExpressionBuilder linearly goes through a long list (~70) of ISequenceBuilder to find which builder can handle an Expression. Each builder is probed by a call to CanBuild, in sequence.

This is horribly not performant. If your query uses one of the last builders -- or simply the expression has no builder, which is sometimes a normal case -- you have to go through 70+ checks!! They are virtual calls, many builders repeatedly perform the same casts or checks, and not all builders perform quick, cheap tests. Of course this is only gonna get worse as we add new builders.

The current design attempts at improving performance by dynamically sorting the list to put the most used builders at the front. This is a good idea but it has two drawbacks.

It adds complexity for the book-keeping of how many times each builder is used. This is mutable concurrent code and requires locks. The amortized perf increases (probably) but the overall perf is worse.
It is not that effective. The list is re-sorted only when the first item would change. I assume this is done to avoid constantly reordering the list if 2 builders fight for a specific spot (although that can happen for 1st place!!). So once the most used builder is found, nothing changes anymore, even if a builder that becomes commonly used is still way down the list at the moment.
As I said before, "not finding a builder" is sometimes an expected use-case and this always go through the whole list.

New strategy

Finding a builder now starts with a big switch statement on NodeType (an int). C# optimizes this depending on the number and distribution of items, it could be a hard-coded binary search (ifs) or a single indirect jump based on a table.

Then a CanBuild method is called, but notice:

At this stage, only 1 or 2 (rarely 3) candidates remain.
CanBuild is now a static method. Short ones will be inlined, which is very likely as most have been simplified thanks to the fact that we know what node to expect when CanBuild is called. Most of them are as simple as => true, => call.IsQueryable() or => call.IsAsyncExtension()

One notable case is ExpressionType.Call because we have many (most) builders for this node type. This is handled by a second switch, this time on MethodCallExpression.Method.Name, which C# will compile into a static hashtable. This is then handled as other node types. In 90% of cases there's only a single candidate left.

There are other benefits, such as clarity. The book-keeping and re-sorting code is gone. Before it was difficult to see which builder handles what, which is particularly important as they must all be mutually exclusive! Now you just need to peek at the switch tables to find what is handled, by what.

Did you know that 3 different builders handle the Join method?
I found a builder that is actually unused! I deleted CountBuilder (methods Count and LongCount are actually handled by AggregationBuilder).

Source generators

I'm not crazy and I didn't write the big switches by hand, that would be too error-prone. Instead, I added a C# Source Generator to the project, that looks for attributes like BuildsExpression or BuildsMethodCall on builders and implements FindBuilderImpl automatically.

Note Tips for source generators:

Consuming a generator in VS generally works nicely. VS will go to definitions in generated sources and you can also see the files in Solution Explorer, under References > Analyzers.

If you modify a generator your changes apply to Build immediately but as of the current version of VS 2022, IDE will continue to use the one it has loaded. The only way to refresh IDE after modifying a generator is to close and reload VS.

To help when working on a generator, you can use <EmitCompilerGeneratedFiles> in csproj to save files generated during build on disk (inside obj). I left this property, commented, in the csproj PR.

As generated code is not committed to git, I uploaded the source generated by this PR here for your convenience: https://gist.github.com/jods4/6b145327f3c7563836c357984b723caf

Code patterns

// BuildsExpression is picked up by the source generator.
// It takes the list of ExpressionType this builder may handle
[BuildsExpression(ExpressionType.Lambda, ExpressionType.Constant)]
class MyBuilder : ISequenceGenerator
{
  // This static method will be called from generated code.
  // `expr` is guaranteed to be of the types specified in attributes.
  // Keep this short and efficient!
  public static bool CanBuild(Expression expr, BuildInfo info, ExpressionBuilder builder)
    => true;
}

// BuildsMethodCall is picked up by the source generator.
// It takes the list of method names that this builder may handle
[BuildsMethodCall("Take", "Skip")]
class MethodBuilder : ISequenceGenerator // or more commonly : MethodCallBuilder
{
  // This static method will be called from generated code.
  // `call` is guaranteed to be for a method whose name is amongst BuildsMethodCall arguments.
  // Keep this short and efficient!
  public static bool CanBuildMethod(MethodCallExpression call, BuildInfo info, ExpressionBuilder builder)
    => true;
}

Special cases

A few builders may handle any method call. You can do this simply by using [BuildsExpression(ExpressionType.Call)], they will be probed after any method-specific builder [BuildsMethodCall]. Today this inculdes MethodChainBuilder, QueryExtensionBuilder, TableBuilder. Those builders tend to look for attributes on methods, e.g. [Extension] or [Table].

Some builders take advantage of having different, simpler checks based on what method or expression type matched. You can have multiple attributes on a single builder, and use the named property CanBuildName to direct the check to a different method for some cases. Many sync/async builders do that, for example ContainsBuilder:

[BuildsMethodCall("Contains")]
[BuildsMethodCall("ContainsAsync", CanBuildName = nameof(CanBuildAsyncMethod))]
sealed class ContainsBuilder : MethodCallBuilder
{
	public static bool CanBuildMethod(MethodCallExpression call, BuildInfo info, ExpressionBuilder builder)
		=> call.IsQueryable() && call.Arguments.Count == 2;

	public static bool CanBuildAsyncMethod(MethodCallExpression call, BuildInfo info, ExpressionBuilder builder)
		=> call.IsAsyncExtension() && call.Arguments.Count == 3;
}

ContextRefBuilder is unique: it seems that it could potentially handle any expression if BuilderInfo.Parent is null. For this case I added [BuildsAny]. Builders with this attribute will be probed after everything else failed. The signature of the static method is CanBuild(BuildInfo info, ExpressionBuilder builder).

Risks and breaking changes

Everything is internal: there's no public API change, so no breaking change.

A lot of files were touched because there is 70+ builders but 95% of changes are very mechanical. I think there's still room for improvements but I wanted to keep this PR straightforward, as it's fairly large. If all tests pass, I'd say the risk of regression is low.

Jun 19 '24 23:06 jods4

Test baselines changed by this PR. Don't forget to merge/close baselines PR after this pr merged/closed.

Jun 20 '24 04:06 linq2dbot

@MaceWindu Tests look good. There are 2 failures:

One is the currently known issue of ORA-01000: maximum open cursors exceeded during one of Oracle 19 tests.
The other is SAP HANA test suite timing-out after one hour.

Neither of those is related to SQL query generation (this PR).

Jun 20 '24 07:06 jods4

/azp run test-all

Jun 29 '24 13:06 MaceWindu

Azure Pipelines successfully started running 1 pipeline(s).

Jun 29 '24 13:06 azure-pipelines[bot]

/azp run test-all

Jun 29 '24 14:06 MaceWindu

Azure Pipelines successfully started running 1 pipeline(s).

Jun 29 '24 14:06 azure-pipelines[bot]

/azp run test-all

Jun 29 '24 14:06 MaceWindu

Azure Pipelines successfully started running 1 pipeline(s).

Jun 29 '24 14:06 azure-pipelines[bot]

Wonder what changed so it affected generation of couple of aliases in baselines. In any case it doesn't look worrying

Jun 29 '24 14:06 MaceWindu

/azp run test-all

Jun 29 '24 17:06 viceroypenguin

Azure Pipelines successfully started running 1 pipeline(s).

Jun 29 '24 17:06 azure-pipelines[bot]

@MaceWindu

Wonder what changed so it affected generation of couple of aliases in baselines. In any case it doesn't look worrying

That's weird indeed, it shouldn't have had any impact on generated SQL, except if multiple builders were competing for the same expression.

More likely that something was missed during the rebase :(

Jun 29 '24 21:06 jods4

/azp run test-all

Jun 29 '24 23:06 viceroypenguin

Azure Pipelines successfully started running 1 pipeline(s).

Jun 29 '24 23:06 azure-pipelines[bot]

/azp run test-all

Jun 29 '24 23:06 viceroypenguin

Azure Pipelines successfully started running 1 pipeline(s).

Jun 29 '24 23:06 azure-pipelines[bot]

@jods4 If you want to look, see here: https://github.com/linq2db/linq2db.baselines/pull/1240/files. Probably not worth researching too deeply, because the SQL is functionally equivalent.

Jun 30 '24 11:06 viceroypenguin

@jods4 If you want to look, see here: https://github.com/linq2db/linq2db.baselines/pull/1240/files. Probably not worth researching too deeply, because the SQL is functionally equivalent.

Ok, well there were some pretty substantial changes in AggregationBuilder that I had to merge... given that all these differences seem to come from aggregates I'm gonna guess this is were the difference comes from. If it's working fine I'm not gonna dig deeper.

Jul 01 '24 15:07 jods4

/azp run test-all

Jul 01 '24 22:07 jods4

Azure Pipelines successfully started running 1 pipeline(s).

Jul 01 '24 22:07 azure-pipelines[bot]