Parallel project analysis behind a feature flag
FCS code analysis currently runs in serial - project-by-project. There is a big potential for speedup by analysing all unrelated projects at the same time. Some discussion around this happened on F# Slack with @vzarytovskii, @cartermp and others.
Given the simplest example:
Project A -- references --> Project B1
Project A -- references --> Project B2
the current order of analysis will be start A, start B1, finish B1, start B2, finish B2, finish A.
With this change the order changes to start A, start B1, start B2, finish B1/B2, finish B2/B1, finish A
What this change does:
-
tcImports.RegisterAndImportReferencedAssembliesinvokes computations for each reference usingAsync.Parallelinstead ofAsync.Sequential. This includes DLL references and project references. - Both DLLs and project references are processed in parallel.
- For project references this means many incremental builders are created and projects analysed in parallel
- There is no logic that scans the project graph up-front - projects are analysed as references are discovered, top-down.
How the feature is enabled
The feature is currently wired up for the two main cases: 1. standalone compiler and 2. FSharpChecker.
It is controlled as follows:
- If the
FCS_ParallelReferenceResolutionenvironment variable is set, its value (trueorfalse) dictates whether the feature is on or off. - If the environment variable is not set (to a valid value):
- FSC falls back to an internal CLI option
--parallelreferenceresolutionwhich when set, enables the feature:
Microsoft (R) F# Compiler version 12.0.5.0 for F# 7.0
Copyright (c) Microsoft Corporation. All Rights Reserved.
warning FS0075: The command-line option '--parallelreferenceresolution' is for test purposes only
-
FSharpChecker.Createhas a new parameter that when set to true, enables the feature.
Some things I'm not 100% sure about:
- I think the cache mechanism for IncrementalBuilders means that for a given project only one builder will be created, even if many dependents will try to process it at the same time (thinking about a race condition in
A -> B1, B2 -> CwhereB1andB2request analysis ofC). - I think the way I'm creating those computations with
NodeCodeandAsync.Parallelis correct, but I'm not sure. - Is it OK to parallelise resolving assemblies and not just projects?
- If this applies to assembly processing, do we want that, and do we want that to apply during compilation as well (rather than code analysis)?
Limiting parallelisation
When enabled, should parallelisation be configurable to maximum of X threads? If so, should that be per request, or globally for the process? How should it be configured? I initially planned to add an option to limit it, but given that separate analysis requests as I understand are completely independent and thus there is no limit to how multi-threaded FCS is already, I would argue that this PR isn't a major change in that regard. However it definitely increases the number of threads running analysis in a typical scenario.
Three issues with that that I can think of:
- too many threads causing CPU contention, context switching - will ThreadPool scheduling save us and not schedule too many operations?
- I/O contention - probably requires looking at the profiler.
- high memory usage
What speedup does this make?
Besides any testing, this change feels very natural to me in principle. At our company we operate on big solutions with ~100 F# projects. Before doing an accurate analysis I'm fairly sure that the level of parallelisation possible in that project graph is >=2, which means almost 2x speedup (and I think it's actually >> 2).
GC consideration
Running code analysis on many threads means much higher allocation rate leading to more work for GC. From the tests I observed that with Workstation GC (which I believe is used by Rider by default) FCS was spending as much as 70% of time in GC.
Enabling Server GC had therefore a huge impact on the timings.
Test results
The below script can be used to measure the impact of this feature.
It utilises the benchmark generator from https://github.com/safesparrow/fsharp-benchmark-generator .
Script: (click to see details)
# Checkout and build a specific version of FSharp codebase
git clone https://github.com/safesparrow/fsharp
Push-Location fsharp
git checkout 4431af5960bcaece76660ec3995faaf7d8d178ac
./build.cmd -c Release -noVisualStudio
Pop-Location
# Checkout benchmark generator
git clone https://github.com/safesparrow/fsharp-benchmark-generator
Push-Location fsharp-benchmark-generator
Function FCSBenchmark {
dotnet run -- -f ((Get-Location).Path + "/../fsharp/artifacts/bin/FSharp.Compiler.Service/release/netstandard2.0/FSharp.Compiler.Service.dll") -i .\inputs\50_leaves.json -n 1
}
# Run 4 benchmarks with 4 different environment configs, save results in files
$env:DOTNET_gcServer=0; $env:FCS_PARALLEL_PROJECTS_ANALYSIS="false"; FCSBenchmark > ../workstation_sequential.log
$env:DOTNET_gcServer=0; $env:FCS_PARALLEL_PROJECTS_ANALYSIS="true"; FCSBenchmark > ../workstation_parallel.log
$env:DOTNET_gcServer=1; $env:FCS_PARALLEL_PROJECTS_ANALYSIS="false"; FCSBenchmark > ../server_sequential.log
$env:DOTNET_gcServer=1; $env:FCS_PARALLEL_PROJECTS_ANALYSIS="true"; FCSBenchmark > ../server_parallel.log
Pop-Location
Test 1: Extremely parallel example: Root -> 50 leaves
See https://github.com/safesparrow/fsharp-benchmark-generator/blob/main/inputs/50_leaves.json for the definition.
Click to see project diagram
10 iteration average (the first iteration is considerably slower):
| GC | Mode | Time |
|---|---|---|
| Workstation | Sequential | 16005ms |
| Workstation | Parallel | 10849ms |
| Server | Sequential | 14594ms |
| Server | Parallel | 2659ms |
Test 2: Almost sequential project structure: Fantomas.Tests
See https://github.com/safesparrow/fsharp-benchmark-generator/blob/main/inputs/fantomas.json for the definition.
Click to see project diagram
Here the only parallelisation can happen between `Fantomas.Client` and `Fantomas.FCS`.
However Fantomas.Client is a 3-file project, so its analysis doesn't take long and my claim is that the ~200ms difference is roughly the time it takes to analyse Fantomas.Client.
| GC | Mode | Time |
|---|---|---|
| Workstation | Sequential | 8496ms |
| Workstation | Parallel | 8353ms |
| Server | Sequential | 6492ms |
| Server | Parallel | 6222ms |
TODO:
- [ ] Discuss all the above issues
- [x] Inject the "analyse in parallel" flag from top-level
- [ ] Provide more test results
- [ ] Test this change on the CI (currently it's disabled by default)
I have provided some test results which show the performance impact. @vzarytovskii Would it be possible to consider this change?
As mentioned in the description there are tweaks to be made (eg. a different way to turn it on/off), but that requires input from the maintainers 🙂
Change is great, and needs a bunch of testing, especially in Visual Studio, since it loads projects differently, and has some quirks regarding the threading.
We will also probably need a flag passed to checker, and a VS feature flag (for easier in-VS testing, which will be shown in the settings).
Update 1: I'd also be interested in differences in memory consumption. Update 2: GC is important, I wonder how does it behave on "slower" machines, I will test it on my laptop (which doesn't have great specs).
Thoughts @KevinRansom @dsyme ( + @TIHan and @cartermp, sorry for the ping :) )
We will also probably need a flag passed to checker, and a VS feature flag (for easier in-VS testing, which will be shown in the settings).
Makes sense. @auduchinok if this change were to land, how much work would it be for Rider to pick up the FCS version that supports it and an option to enable it?
Update 1: I'd also be interested in differences in memory consumption.
Yes, that's something worth looking at.
Would be nice to add more stats to https://github.com/safesparrow/fsharp-benchmark-generator/ .
Possibly worth making Benchmark.Runner a BDN benchmark with carefully chosen config as to not make it take too long.
Update 2: GC is important, I wonder how does it behave on "slower" machines, I will test it on my laptop (which doesn't have great specs).
Would be lovely if one day this question can be answered automatically by running performance tests on a variety of predefined [possibly virtual, if results are not skewed too much] machines.
For now we can capture those stats in https://github.com/safesparrow/fsharp-benchmark-generator/.
In the short-term I think what we should do is:
- gather a set of metrics to look for
- gather/generate a set of codebases to test this on
Then we can either:
- find enough variations of hardware by asking people here to volunteer
- get the hardware from the cloud
And then run https://github.com/safesparrow/fsharp-benchmark-generator/ for every combination.
Regarding VS: What's special about it? Can we distill any/most important unique environment conditions it provides into a reusable piece of code that can then be trivially set up in a benchmark?
Makes sense. @auduchinok if this change were to land, how much work would it be for Rider to pick up the FCS version that supports it and an option to enable it?
It would be a bit more difficult this time, since there're many impactful refactorings has been made upstream. 🙂 I'm going to update it for 2022.3 release at some time during EAP.
Regarding VS: What's special about it?
If I am not mistaken, it pretty much has its own threading model/primitives which may affect the behaviour here.
Regarding VS: What's special about it?
If I am not mistaken, it pretty much has its own threading model/primitives which may affect the behaviour here.
I think this is another vote for making this opt-in, but nonetheless: Do you know if it would be possible to use the same threading model in a benchmark to increase coverage without needing the full VS? Obviously we'd want to try it in full VS at some point. But hopefully not all future changes like this (but maybe less significant) require a lot of manual effort to verify behaviour in VS.
I think the cache mechanism for IncrementalBuilders means that for a given project only one builder will be created, even if many dependents will try to process it at the same time (thinking about a race condition in A -> B1, B2 -> C where B1 and B2 request analysis of C).
Yes, this is correct
I think the way I'm creating those computations with NodeCode and Async.Parallel is correct, but I'm not sure.
This is correct.
Is it OK to parallelise resolving assemblies and not just projects?
I think it is ok.
On initial review I spotted some concerns about concurrent assembly loading/referencing, I'll document those below. However these are not in practice a problem in the case of IncrementalBuilder, the overall assembly loading and resolution process is held within a GraphNode fot the unique builder, and all requests requiring the builder will have to await the completion of that GraphNode.
The part I was concerned about are concurrent access with regard to the RegisterDll and RegisterCcu calls which are executing before the phase2 fixups executed, that is all these calls:
TryRegisterAndPrepareToImportReferencedDll
tcImports.RegisterDll dllinfo
PrepareToImportReferencedILAssembly:
tcImports.RegisterCcu ccuinfo
PrepareToImportReferencedFSharpAssembly:
ccuRawDataAndInfos |> List.iter (p23 >> tcImports.RegisterCcu)
Currently the sequence is
Sequential [
Phase1 Assembly A
RegisterDll Assembly A
RegisterCcu Assembly A
Phase1 Assembly B
RegisterDll Assembly B
RegisterCcu Assembly B
Phase2 Assembly A
Phase2 Assembly B
]
With the change it would become
Parallel [
Sequential [
Phase1Assembly A
RegisterDll Assembly A
RegisterCcu Assembly A
]
Sequential [
Phase1Assembly B
RegisterDll Assembly B
RegisterCcu Assembly B
]
]
Phase2 Assembly A
Phase2 Assembly B
This amounts to the same thing, and as mentioned all of it is within the control of a sequentializing GraphNode here or here.
If this applies to assembly processing, do we want that, and do we want that to apply during compilation as well (rather than code analysis)?
Yes, we would should apply it during command-line compilation as well, subject to performance testing.
I have now wired up the feature setting from top-level in both FSC and FSharpChecker.
The user can opt-in to the feature by setting the environment variable FCS_ParallelReferenceResolution=true, or through MSBuild/IDE.
I also raised https://github.com/dotnet/fsharp/pull/13835 with a feature that produces OpenTelemetry traces, and the description contains example traces that compare the type-checking timeline with the feature on and off - see https://github.com/dotnet/fsharp/pull/13835#issue-1360897631 . This was done to produce results that would help verify, that the behaviour of parallel analysis is as expected.
The few performance tests from which results I shared show the expected benefits from parallel work and as far as I can see, nothing indicates that work is being duplicated.
@vzarytovskii @dsyme What would be the next steps before this PR can be accepted as an opt-in, non-public/experimental feature?
@safesparrow I can't merge to your branch, can you merge main into this please?
@safesparrow Thanks!
@safesparrow Actually it's showing it's needs another merge (normally we can update but for some reason your branches don't allow it)
Actually it's showing it's needs another merge
Ah, those changes in main are coming quicker than PR checks can run 😄 I did another update now.
normally we can update but for some reason your branches don't allow it
Not sure why. The Allow edits by maintainers checkbox is checked:

Not sure why, but a single regression test failed after the latest merge. I submitted a dummy commit to trigger a rerun.
EDIT: Actually during one of the merges I mishandled git and ended up deleting one of the regression test samples. This time tests should pass 🤞
You seem to need to keep merging, my apologies.