linq2db Improve sequence building performance

Limitations of current design

Today ExpressionBuilder linearly goes through a long list (~70) of ISequenceBuilder to find which builder can handle an Expression. Each builder is probed by a call to CanBuild, in sequence.

This is horribly not performant. If your query uses one of the last builders -- or simply the expression has no builder, which is sometimes a normal case -- you have to go through 70+ checks!! They are virtual calls, many builders repeatedly perform the same casts or checks, and not all builders perform quick, cheap tests. Of course this is only gonna get worse as we add new builders.

The current design attempts at improving performance by dynamically sorting the list to put the most used builders at the front. This is a good idea but it has two drawbacks.

It adds complexity for the book-keeping of how many times each builder is used. This is mutable concurrent code and requires locks. The amortized perf increases (probably) but the overall perf is worse.
It is not that effective. The list is re-sorted only when the first item would change. I assume this is done to avoid constantly reordering the list if 2 builders fight for a specific spot (although that can happen for 1st place!!). So once the most used builder is found, nothing changes anymore, even if a builder that becomes commonly used is still way down the list at the moment. As I said before, "not finding a builder" is sometimes an expected use-case and this always go through the whole list.

New strategy

Finding a builder now starts with a big switch statement on NodeType (an int). C# optimizes this depending on the number and distribution of items, it could be a hard-coded binary search (ifs) or a single indirect jump based on a table.

Then a CanBuild method is called, but notice:

At this stage, only 1 or 2 (rarely 3) candidates remain.
CanBuild is now a static method. Short ones will be inlined, which is very likely as most have been simplified thanks to the fact that we know what node to expect when CanBuild is called. Most of them are as simple as => true, => call.IsQueryable() or => call.IsAsyncExtension()

One notable case is ExpressionType.Call because we have many (most) builders for this node type. This is handled by a second switch, this time on MethodCallExpression.Method.Name, which C# will compile into a static hashtable. This is then handled as other node types. In 90% of cases there's only a single candidate left.

There are other benefits, such as clarity. The book-keeping and re-sorting code is gone. Before it was difficult to see which builder handles what, which is particularly important as they must all be mutually exclusive! Now you just need to peek at the switch tables to find what is handled, by what.

Did you know that 3 different builders handle the Join method?
I found a builder that is actually unused! I deleted CountBuilder (methods Count and LongCount are actually handled by AggregationBuilder).

Source generators

I'm not crazy and I didn't write the big switches by hand, that would be too error-prone. Instead, I added a C# Source Generator to the project, that looks for attributes like BuildsExpression or BuildsMethodCall on builders and implements FindBuilderImpl automatically.

Note Tips for source generators:

Consuming a generator in VS generally works nicely. VS will go to definitions in generated sources and you can also see the files in Solution Explorer, under References > Analyzers.

If you modify a generator your changes apply to Build immediately but as of the current version of VS 2022, IDE will continue to use the one it has loaded. The only way to refresh IDE after modifying a generator is to close and reload VS.

To help when working on a generator, you can use <EmitCompilerGeneratedFiles> in csproj to save files generated during build on disk (inside obj). I left this property, commented, in the csproj PR.

As generated code is not committed to git, I uploaded the source generated by this PR here for your convenience: https://gist.github.com/jods4/6b145327f3c7563836c357984b723caf

Code patterns

// BuildsExpression is picked up by the source generator.
// It takes the list of ExpressionType this builder may handle
[BuildsExpression(ExpressionType.Lambda, ExpressionType.Constant)]
class MyBuilder : ISequenceGenerator
{
  // This static method will be called from generated code.
  // `expr` is guaranteed to be of the types specified in attributes.
  // Keep this short and efficient!
  public static bool CanBuild(Expression expr, BuildInfo info, ExpressionBuilder builder)
    => true;
}

// BuildsMethodCall is picked up by the source generator.
// It takes the list of method names that this builder may handle
[BuildsMethodCall("Take", "Skip")]
class MethodBuilder : ISequenceGenerator // or more commonly : MethodCallBuilder
{
  // This static method will be called from generated code.
  // `call` is guaranteed to be for a method whose name is amongst BuildsMethodCall arguments.
  // Keep this short and efficient!
  public static bool CanBuildMethod(MethodCallExpression call, BuildInfo info, ExpressionBuilder builder)
    => true;
}

Special cases

A few builders may handle any method call. You can do this simply by using [BuildsExpression(ExpressionType.Call)], they will be probed after any method-specific builder [BuildsMethodCall]. Today this inculdes MethodChainBuilder, QueryExtensionBuilder, TableBuilder. Those builders tend to look for attributes on methods, e.g. [Extension] or [Table].

Some builders take advantage of having different, simpler checks based on what method or expression type matched. You can have multiple attributes on a single builder, and use the named property CanBuildName to direct the check to a different method for some cases. Many sync/async builders do that, for example ContainsBuilder:

[BuildsMethodCall("Contains")]
[BuildsMethodCall("ContainsAsync", CanBuildName = nameof(CanBuildAsyncMethod))]
sealed class ContainsBuilder : MethodCallBuilder
{
	public static bool CanBuildMethod(MethodCallExpression call, BuildInfo info, ExpressionBuilder builder)
		=> call.IsQueryable() && call.Arguments.Count == 2;

	public static bool CanBuildAsyncMethod(MethodCallExpression call, BuildInfo info, ExpressionBuilder builder)
		=> call.IsAsyncExtension() && call.Arguments.Count == 3;
}

ContextRefBuilder is unique: it seems that it could potentially handle any expression if BuilderInfo.Parent is null. For this case I added [BuildsAny]. Builders with this attribute will be probed after everything else failed. The signature of the static method is CanBuild(BuildInfo info, ExpressionBuilder builder).

Risks and breaking changes

Everything is internal: there's no public API change, so no breaking change.

A lot of files were touched because there is 70+ builders but 95% of changes are very mechanical. I think there's still room for improvements but I wanted to keep this PR straightforward, as it's fairly large. If all tests pass, I'd say the risk of regression is low.

Jan 15 '23 16:01 jods4

Some wcf remoting examples don't compile because of http in System.Net when targetting .net 4.8. @sdanyliv I suppose that's already the case on your branch, as it has nothing to do with my changes.

Jan 15 '23 19:01 jods4

I see PR is based on parser refactoring feature, so I assign it to v6 milestone

Jan 17 '23 11:01 MaceWindu

@MaceWindu yes, this is on top of the parser refactoring. I could have done it on top of v4, but I didn't want to have to merge with all @sdanyliv changes.

Jan 17 '23 12:01 jods4

Do you have an example that shows the problem? I would like to look at in performance profiler.

Feb 28 '23 15:02 igor-tkachev

@igor-tkachev if you want to see the worst query compilations, you basically need to use the builders that are close to the end of the list.

A good example would be some complex MERGE statement, because it has many parts and they are all close to the end of the list. For bonus points be sure to tag the query, as that new feature is 4th from last.

That said, be careful after the first query compilation because linq2db attempts to improve performance by sorting the list of potential builders, which means your query will now be vastly more performant on the 2nd run. Priority sorting is a good idea but it's far from perfect in practice. The list is only sorted if the 1st element would change, so if one type of query dominates from very early on (e.g. SelectBuilder), then the rest won't be sorted even if you suddenly use a significant count of MergeBuilder queries. Also, if you do many different kind of queries, you'll still pay the price for the list scanning as they can't all be at the front.

So if you're doing benchmarking, maybe execute a bunch of SELECT before benchmarking some MERGE, to ensure the required builders stay close to the end of the list during your tests.

In fact, another performance issue that is a bit unlikely would be when 2 different builders fight for the top spot. Then the list would be constantly updated, which is another perf hit.

Finally, the current code uses locks, so if you compile many queries in parallel, contention might impact perf. I think that's unlikely to show up in simple benchmarks.

Mar 01 '23 00:03 jods4

It looks like we need some mechanism to collect metrics/statistics of our code and create a report at the end of the test run.. So, we will be able to compare different solutions.

Mar 02 '23 01:03 igor-tkachev

@viceroypenguin Thanks for reviewing, it's a big changeset. 🙏

Oct 30 '23 16:10 jods4

Limitations of current design

Today ExpressionBuilder linearly goes through a long list (~70) of ISequenceBuilder to find which builder can handle an Expression. Each builder is probed by a call to CanBuild, in sequence.

This is horribly not performant.

According to metrics this is not an issue and old strategy performance is fine.

Jan 21 '24 20:01 igor-tkachev

Limitations of current design

Today ExpressionBuilder linearly goes through a long list (~70) of ISequenceBuilder to find which builder can handle an Expression. Each builder is probed by a call to CanBuild, in sequence. This is horribly not performant.

According to metrics this is not an issue and old strategy performance is fine.

@igor-tkachev

The main loop that was locking and I got rid of is method BuildSequence, which accounts for 3.9% of the time according to your benchmark.
Of course it's never gonna be an overall dominator compared to ADO.NET execution or visitors... but 3.9% is far from nothing, esp. considering how little work this is supposed to perform! Methods that account for >4% are very few in your trace.
I'd love to see a high-load scenario with many concurrent requests. The previous code had locking in it, which does not scale. Executing sequential tests or benchmarks does not reflect that.
Can you time old BuildSequence vs new BuildSequence? There's not a shadow of a doubt that the new one is much faster. Even if this is not a dominator for overall execution, why not take improvements?
It's not only faster, but in my opinion much simpler to understand. Old code had concurrent mutable state, used locks... tricky stuff.
The code generator project might also just come handy on its own in other parts of linq2db codebase.

What's your objection to merging this? I don't see how it isn't an overall improvement in all aspects.

May 03 '24 23:05 jods4

@jods4 looks like this PR was closed automatically when base branch removed. Please rebase it on master or delete branch if you don't plan to work on it

Jun 17 '24 07:06 MaceWindu

@MaceWindu Yeah it was a substantial amount of work and I think the result is beneficial, so let's not dismiss it. I'll try to rebase on master.

Jun 17 '24 07:06 jods4

@MaceWindu created #4542 on master. Let's see if all tests pass.

Jun 19 '24 23:06 jods4

linq2db linq2db copied to clipboard

Improve sequence building performance

Limitations of current design

New strategy

Source generators

Code patterns

Special cases

Risks and breaking changes

Limitations of current design

Limitations of current design

linq2db
linq2db copied to clipboard