fsharp Parallel project analysis behind a feature flag

FCS code analysis currently runs in serial - project-by-project. There is a big potential for speedup by analysing all unrelated projects at the same time. Some discussion around this happened on F# Slack with @vzarytovskii, @cartermp and others.

Given the simplest example:

Project A -- references --> Project B1
Project A -- references --> Project B2

the current order of analysis will be start A, start B1, finish B1, start B2, finish B2, finish A. With this change the order changes to start A, start B1, start B2, finish B1/B2, finish B2/B1, finish A

What this change does:

tcImports.RegisterAndImportReferencedAssemblies invokes computations for each reference using Async.Parallel instead of Async.Sequential. This includes DLL references and project references.
Both DLLs and project references are processed in parallel.
For project references this means many incremental builders are created and projects analysed in parallel
There is no logic that scans the project graph up-front - projects are analysed as references are discovered, top-down.

How the feature is enabled

The feature is currently wired up for the two main cases: 1. standalone compiler and 2. FSharpChecker. It is controlled as follows:

If the FCS_ParallelReferenceResolution environment variable is set, its value (true or false) dictates whether the feature is on or off.
If the environment variable is not set (to a valid value):

FSC falls back to an internal CLI option --parallelreferenceresolution which when set, enables the feature:

Microsoft (R) F# Compiler version 12.0.5.0 for F# 7.0
Copyright (c) Microsoft Corporation. All Rights Reserved.

warning FS0075: The command-line option '--parallelreferenceresolution' is for test purposes only

FSharpChecker.Create has a new parameter that when set to true, enables the feature.

Some things I'm not 100% sure about:

I think the cache mechanism for IncrementalBuilders means that for a given project only one builder will be created, even if many dependents will try to process it at the same time (thinking about a race condition in A -> B1, B2 -> C where B1 and B2 request analysis of C).
I think the way I'm creating those computations with NodeCode and Async.Parallel is correct, but I'm not sure.
Is it OK to parallelise resolving assemblies and not just projects?
If this applies to assembly processing, do we want that, and do we want that to apply during compilation as well (rather than code analysis)?

Limiting parallelisation

When enabled, should parallelisation be configurable to maximum of X threads? If so, should that be per request, or globally for the process? How should it be configured? I initially planned to add an option to limit it, but given that separate analysis requests as I understand are completely independent and thus there is no limit to how multi-threaded FCS is already, I would argue that this PR isn't a major change in that regard. However it definitely increases the number of threads running analysis in a typical scenario.

Three issues with that that I can think of:

too many threads causing CPU contention, context switching - will ThreadPool scheduling save us and not schedule too many operations?
I/O contention - probably requires looking at the profiler.
high memory usage

What speedup does this make?

Besides any testing, this change feels very natural to me in principle. At our company we operate on big solutions with ~100 F# projects. Before doing an accurate analysis I'm fairly sure that the level of parallelisation possible in that project graph is >=2, which means almost 2x speedup (and I think it's actually >> 2).

GC consideration

Running code analysis on many threads means much higher allocation rate leading to more work for GC. From the tests I observed that with Workstation GC (which I believe is used by Rider by default) FCS was spending as much as 70% of time in GC.

Enabling Server GC had therefore a huge impact on the timings.

Test results

The below script can be used to measure the impact of this feature.

It utilises the benchmark generator from https://github.com/safesparrow/fsharp-benchmark-generator .

Script: (click to see details)

# Checkout and build a specific version of FSharp codebase
git clone https://github.com/safesparrow/fsharp
Push-Location fsharp
git checkout 4431af5960bcaece76660ec3995faaf7d8d178ac
./build.cmd -c Release -noVisualStudio
Pop-Location

# Checkout benchmark generator
git clone https://github.com/safesparrow/fsharp-benchmark-generator
Push-Location fsharp-benchmark-generator

Function FCSBenchmark {
    dotnet run -- -f ((Get-Location).Path + "/../fsharp/artifacts/bin/FSharp.Compiler.Service/release/netstandard2.0/FSharp.Compiler.Service.dll") -i .\inputs\50_leaves.json -n 1
}

# Run 4 benchmarks with 4 different environment configs, save results in files
$env:DOTNET_gcServer=0; $env:FCS_PARALLEL_PROJECTS_ANALYSIS="false"; FCSBenchmark > ../workstation_sequential.log
$env:DOTNET_gcServer=0; $env:FCS_PARALLEL_PROJECTS_ANALYSIS="true"; FCSBenchmark > ../workstation_parallel.log
$env:DOTNET_gcServer=1; $env:FCS_PARALLEL_PROJECTS_ANALYSIS="false"; FCSBenchmark > ../server_sequential.log
$env:DOTNET_gcServer=1; $env:FCS_PARALLEL_PROJECTS_ANALYSIS="true"; FCSBenchmark > ../server_parallel.log

Pop-Location

Test 1: Extremely parallel example: Root -> 50 leaves

See https://github.com/safesparrow/fsharp-benchmark-generator/blob/main/inputs/50_leaves.json for the definition.

Click to see project diagram

10 iteration average (the first iteration is considerably slower):

GC	Mode	Time
Workstation	Sequential	16005ms
Workstation	Parallel	10849ms
Server	Sequential	14594ms
Server	Parallel	2659ms

Test 2: Almost sequential project structure: Fantomas.Tests

See https://github.com/safesparrow/fsharp-benchmark-generator/blob/main/inputs/fantomas.json for the definition.

Click to see project diagram

Here the only parallelisation can happen between `Fantomas.Client` and `Fantomas.FCS`.

However Fantomas.Client is a 3-file project, so its analysis doesn't take long and my claim is that the ~200ms difference is roughly the time it takes to analyse Fantomas.Client.

GC	Mode	Time
Workstation	Sequential	8496ms
Workstation	Parallel	8353ms
Server	Sequential	6492ms
Server	Parallel	6222ms

TODO:

[ ] Discuss all the above issues
[x] Inject the "analyse in parallel" flag from top-level
[ ] Provide more test results
[ ] Test this change on the CI (currently it's disabled by default)

Jul 17 '22 20:07 safesparrow

I have provided some test results which show the performance impact. @vzarytovskii Would it be possible to consider this change?

As mentioned in the description there are tweaks to be made (eg. a different way to turn it on/off), but that requires input from the maintainers 🙂

Jul 23 '22 13:07 safesparrow

Change is great, and needs a bunch of testing, especially in Visual Studio, since it loads projects differently, and has some quirks regarding the threading.

We will also probably need a flag passed to checker, and a VS feature flag (for easier in-VS testing, which will be shown in the settings).

Update 1: I'd also be interested in differences in memory consumption. Update 2: GC is important, I wonder how does it behave on "slower" machines, I will test it on my laptop (which doesn't have great specs).

Thoughts @KevinRansom @dsyme ( + @TIHan and @cartermp, sorry for the ping :) )

Jul 25 '22 08:07 vzarytovskii

We will also probably need a flag passed to checker, and a VS feature flag (for easier in-VS testing, which will be shown in the settings).

Makes sense. @auduchinok if this change were to land, how much work would it be for Rider to pick up the FCS version that supports it and an option to enable it?

Update 1: I'd also be interested in differences in memory consumption.

Yes, that's something worth looking at. Would be nice to add more stats to https://github.com/safesparrow/fsharp-benchmark-generator/ . Possibly worth making Benchmark.Runner a BDN benchmark with carefully chosen config as to not make it take too long.

Update 2: GC is important, I wonder how does it behave on "slower" machines, I will test it on my laptop (which doesn't have great specs).

Would be lovely if one day this question can be answered automatically by running performance tests on a variety of predefined [possibly virtual, if results are not skewed too much] machines.

For now we can capture those stats in https://github.com/safesparrow/fsharp-benchmark-generator/.

In the short-term I think what we should do is:

gather a set of metrics to look for
gather/generate a set of codebases to test this on

Then we can either:

find enough variations of hardware by asking people here to volunteer
get the hardware from the cloud

And then run https://github.com/safesparrow/fsharp-benchmark-generator/ for every combination.

Regarding VS: What's special about it? Can we distill any/most important unique environment conditions it provides into a reusable piece of code that can then be trivially set up in a benchmark?

Jul 25 '22 11:07 safesparrow

Makes sense. @auduchinok if this change were to land, how much work would it be for Rider to pick up the FCS version that supports it and an option to enable it?

It would be a bit more difficult this time, since there're many impactful refactorings has been made upstream. 🙂 I'm going to update it for 2022.3 release at some time during EAP.

Jul 25 '22 11:07 auduchinok

Regarding VS: What's special about it?

If I am not mistaken, it pretty much has its own threading model/primitives which may affect the behaviour here.

Jul 25 '22 15:07 vzarytovskii

Regarding VS: What's special about it?

If I am not mistaken, it pretty much has its own threading model/primitives which may affect the behaviour here.

I think this is another vote for making this opt-in, but nonetheless: Do you know if it would be possible to use the same threading model in a benchmark to increase coverage without needing the full VS? Obviously we'd want to try it in full VS at some point. But hopefully not all future changes like this (but maybe less significant) require a lot of manual effort to verify behaviour in VS.

Jul 25 '22 16:07 safesparrow

I think the cache mechanism for IncrementalBuilders means that for a given project only one builder will be created, even if many dependents will try to process it at the same time (thinking about a race condition in A -> B1, B2 -> C where B1 and B2 request analysis of C).

Yes, this is correct

I think the way I'm creating those computations with NodeCode and Async.Parallel is correct, but I'm not sure.

This is correct.

Is it OK to parallelise resolving assemblies and not just projects?

I think it is ok.

On initial review I spotted some concerns about concurrent assembly loading/referencing, I'll document those below. However these are not in practice a problem in the case of IncrementalBuilder, the overall assembly loading and resolution process is held within a GraphNode fot the unique builder, and all requests requiring the builder will have to await the completion of that GraphNode.

The part I was concerned about are concurrent access with regard to the RegisterDll and RegisterCcu calls which are executing before the phase2 fixups executed, that is all these calls:

TryRegisterAndPrepareToImportReferencedDll
                    tcImports.RegisterDll dllinfo

PrepareToImportReferencedILAssembly: 

        tcImports.RegisterCcu ccuinfo

PrepareToImportReferencedFSharpAssembly:

        ccuRawDataAndInfos |> List.iter (p23 >> tcImports.RegisterCcu)

Currently the sequence is

    Sequential [
        Phase1 Assembly A
        RegisterDll Assembly A
        RegisterCcu  Assembly A
        Phase1 Assembly B
        RegisterDll Assembly B
        RegisterCcu  Assembly B
        Phase2 Assembly A
        Phase2 Assembly B
    ]

With the change it would become

    Parallel [
        Sequential [
            Phase1Assembly A
            RegisterDll Assembly A
            RegisterCcu  Assembly A
        ]
        Sequential [
            Phase1Assembly B
            RegisterDll Assembly B
            RegisterCcu  Assembly B
        ]
    ]
    Phase2 Assembly A
    Phase2 Assembly B

This amounts to the same thing, and as mentioned all of it is within the control of a sequentializing GraphNode here or here.

If this applies to assembly processing, do we want that, and do we want that to apply during compilation as well (rather than code analysis)?

Yes, we would should apply it during command-line compilation as well, subject to performance testing.

Jul 28 '22 14:07 dsyme

I have now wired up the feature setting from top-level in both FSC and FSharpChecker. The user can opt-in to the feature by setting the environment variable FCS_ParallelReferenceResolution=true, or through MSBuild/IDE.

I also raised https://github.com/dotnet/fsharp/pull/13835 with a feature that produces OpenTelemetry traces, and the description contains example traces that compare the type-checking timeline with the feature on and off - see https://github.com/dotnet/fsharp/pull/13835#issue-1360897631 . This was done to produce results that would help verify, that the behaviour of parallel analysis is as expected.

The few performance tests from which results I shared show the expected benefits from parallel work and as far as I can see, nothing indicates that work is being duplicated.

@vzarytovskii @dsyme What would be the next steps before this PR can be accepted as an opt-in, non-public/experimental feature?

Sep 17 '22 12:09 safesparrow

@safesparrow I can't merge to your branch, can you merge main into this please?

Sep 23 '22 18:09 dsyme

@safesparrow Thanks!

Sep 23 '22 19:09 dsyme

@safesparrow Actually it's showing it's needs another merge (normally we can update but for some reason your branches don't allow it)

Sep 23 '22 19:09 dsyme

Actually it's showing it's needs another merge

Ah, those changes in main are coming quicker than PR checks can run 😄 I did another update now.

normally we can update but for some reason your branches don't allow it

Not sure why. The Allow edits by maintainers checkbox is checked:

Sep 23 '22 19:09 safesparrow

Not sure why, but a single regression test failed after the latest merge. I submitted a dummy commit to trigger a rerun.

EDIT: Actually during one of the merges I mishandled git and ended up deleting one of the regression test samples. This time tests should pass 🤞

Sep 23 '22 21:09 safesparrow

You seem to need to keep merging, my apologies.

Sep 24 '22 01:09 dsyme