azure-functions-host
azure-functions-host copied to clipboard
Performance: Optimize MetricsEventManager collection and rollup
Part of #7908 (and supersedes #7927 which had some history discussion as ref). This set of optimizations is paired because they're both metrics and the locks/contention are both in play from FunctionStarted
and FunctionCompleted
so really both need to be improved to make progress.
FunctionsActivityTracker
This is a non-trivial source of allocations under load. We're flushing the current _functionMetricsQueue
collection every 5 seconds and during that time we build up event allocations (of FunctionMetrics
) to later summarize and dispose/GC. Instead of doing this, we can keep track of the metrics much cheaper as we go. That's what this PR is doing by using a mutable struct (don't tell my mom) and a by ref increment on it for each metric call as we go. When it's time to record, we swap out the dictionary for another one and iterate/record that snapshot.
MetricsEventManager
In here QueuedEvents
is another allocation and lock source via captures into the dictionary AddOrUpdate
args and the locking contention within (especially around update in a load test).
Note that EventKey
is a struct with more code when a readonly record struct
would be much cleaner...this is because StyleCop has an intermittent compile bug with records that is fixed but not released. I added a comment around this - we can go back and swap it fast later. This swap is predicated on case-sensitive being okay. I regressed performance a bit in this PR to maintain compatibility but I think we'd be okay removing case sensitivity - if so, we can do this and increase perf with minimal changes today (see a9e043753250ec30dc20ae1b1583aeaa9b67883a for the diff).
Overall, this PR the following benefits:
- Far fewer allocations per metric call
- Many -> no allocations to aggregate (see benchmarks in outdated comments below for LINQ overhead)
- Less LOH (large object heap) interaction as no collection gets very large (total ceiling cardinality is the distinct function count)
- Recording time is now O(1) rather than O(n^2)
- No locking around
QueuedEvents.AddOrUpdate
- No capture allocations for
QueuedEvents
It's meant to be a cheaper/faster way to get the same metrics. This will help always, but more so under load tests where we aggregate more events per 5 seconds.
Issue describing the changes in this PR
Resolves a component of #7908
Pull request checklist
- [x] My changes do not require documentation changes
- [ ] Otherwise: Documentation issue linked to PR
- [x] My changes should not be added to the release notes for the next release
- [ ] Otherwise: I've added my notes to
release_notes.md
- [ ] Otherwise: I've added my notes to
- [x] My changes do not need to be backported to a previous version
- [ ] Otherwise: Backport tracked by issue/PR #issue_or_pr
- [x] I have added all required tests (Unit tests, E2E tests)
Crank benchmarks
application | dev2-1 | eventsstruct2-7 | |
---|---|---|---|
CPU Usage (%) | 7 | 7 | 0.00% |
Cores usage (%) | 202 | 200 | -0.99% |
Working Set (MB) | 308 | 359 | +16.56% |
Private Memory (MB) | 1,133 | 1,157 | +2.12% |
Build Time (ms) | 17,993 | 19,415 | +7.90% |
Start Time (ms) | 4,778 | 4,814 | +0.75% |
Published Size (KB) | 674,687 | 674,685 | -0.00% |
.NET Core SDK Version | 6.0.101 | 6.0.101 | |
Max CPU Usage (%) | 99 | 99 | 0.00% |
Max Working Set (MB) | 322 | 365 | +13.35% |
Max GC Heap Size (MB) | 128 | 157 | +22.66% |
Size of committed memory by the GC (MB) | 134 | 186 | +38.81% |
Max Number of Gen 0 GCs / sec | 13.00 | 13.00 | 0.00% |
Max Number of Gen 1 GCs / sec | 6.00 | 6.00 | 0.00% |
Max Number of Gen 2 GCs / sec | 1.00 | 1.00 | 0.00% |
Max Time in GC (%) | 6.00 | 7.00 | +16.67% |
Max Gen 0 Size (B) | 55,366,672 | 20,417,288 | -63.12% |
Max Gen 1 Size (B) | 9,674,736 | 10,629,888 | +9.87% |
Max Gen 2 Size (B) | 26,395,496 | 17,749,984 | -32.75% |
Max LOH Size (B) | 9,321,280 | 4,601,568 | -50.63% |
Max Allocation Rate (B/sec) | 333,640,760 | 326,232,888 | -2.22% |
Max GC Heap Fragmentation | 77 | 49 | -35.61% |
# of Assemblies Loaded | 201 | 201 | 0.00% |
Max Exceptions (#/s) | 279 | 261 | -6.45% |
Max Lock Contention (#/s) | 114 | 80 | -29.82% |
Max ThreadPool Threads Count | 32 | 30 | -6.25% |
Max ThreadPool Queue Length | 95 | 92 | -3.16% |
Max ThreadPool Items (#/s) | 36,150 | 36,719 | +1.57% |
Max Active Timers | 98 | 93 | -5.10% |
IL Jitted (B) | 1,054,878 | 1,056,139 | +0.12% |
Methods Jitted | 10,887 | 10,899 | +0.11% |
Avg Allocation Rate (B/sec) | 302,116,391 | 295,875,083 | -2.07% |
90th Allocation Rate (B/sec) | 331,095,488 | 323,239,744 | -2.37% |
load | dev2-1 | eventsstruct2-7 | |
---|---|---|---|
CPU Usage (%) | 1 | 1 | 0.00% |
Cores usage (%) | 36 | 38 | +5.56% |
Working Set (MB) | 37 | 37 | 0.00% |
Private Memory (MB) | 357 | 357 | 0.00% |
Start Time (ms) | 0 | 0 | |
First Request (ms) | 350 | 342 | -2.29% |
Requests/sec | 8,072 | 8,183 | +1.38% |
Requests | 485,103 | 491,761 | +1.37% |
Mean latency (ms) | 13.95 | 13.49 | -3.30% |
Max latency (ms) | 179.16 | 163.82 | -8.56% |
Bad responses | 0 | 0 | |
Socket errors | 0 | 0 | |
Read throughput (MB/s) | 1.24 | 1.26 | +1.61% |
Latency 50th (ms) | 10.02 | 9.93 | -0.90% |
Latency 75th (ms) | 15.93 | 15.25 | -4.27% |
Latency 90th (ms) | 28.37 | 26.86 | -5.32% |
Latency 99th (ms) | 65.38 | 60.27 | -7.82% |
Have to break for boosters here but I'll move this to draft and go figure out why build is angry.
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
While this is a ~1% gain, I'm not super happy with the structure here - we're still allocating in a slab mechanism to the LOH which feels like we can do better. Going to poke this a bit more.
Notes for later: we can replace the LINQ here with a dictionary build and do an atomic swap of the queue to eliminate the copy in a snapshot. Example benchmark:
#LINQPad optimize+ // Enable compiler optimizations
void Main()
{
Util.AutoScrollResults = true;
BenchmarkRunner.Run<LinqBenchmarks>();
}
[MemoryDiagnoser]
public class LinqBenchmarks
{
private readonly ConcurrentQueue<FunctionMetrics> _functionMetricsQueue = new ConcurrentQueue<FunctionMetrics>();
[GlobalSetup]
public void Setup()
{
// Generate enough to make it non-trivial
for (var i = 0; i < 100; i++)
{
for (var j = 0; j < 50; j++)
{
_functionMetricsQueue.Enqueue(new FunctionMetrics("Test Name" + j.ToString(), ExecutionStage.Started, 0));
_functionMetricsQueue.Enqueue(new FunctionMetrics("Test Name" + j.ToString(), ExecutionStage.InProgress, 0));
_functionMetricsQueue.Enqueue(new FunctionMetrics("Test Name" + j.ToString(), ExecutionStage.Succeeded, 4));
_functionMetricsQueue.Enqueue(new FunctionMetrics("Test Name" + j.ToString(), ExecutionStage.Finished, 4));
}
}
}
[Benchmark]
public void LinqMethod()
{
var aggregatedEventsPerFunction = from item in _functionMetricsQueue
group item by item.FunctionName into functionGroups
select new
{
FunctionName = functionGroups.Key,
StartedCount = Convert.ToUInt64(functionGroups.Count(x => x.ExecutionStage == ExecutionStage.Started)),
FailedCount = Convert.ToUInt64(functionGroups.Count(x => x.ExecutionStage == ExecutionStage.Failed)),
SucceededCount = Convert.ToUInt64(functionGroups.Count(x => x.ExecutionStage == ExecutionStage.Succeeded)),
TotalExectionTimeInMs = Convert.ToUInt64(functionGroups.Sum(x => Convert.ToDecimal(x.ExecutionTimeInMS)))
};
foreach (var functionEvent in aggregatedEventsPerFunction)
{
GC.KeepAlive(functionEvent);
}
}
[Benchmark]
public void DictionaryMethod()
{
var aggregatedEventsPerFunction = new Dictionary<string, MetricSummary>();
foreach (var metric in _functionMetricsQueue)
{
if (!aggregatedEventsPerFunction.TryGetValue(metric.FunctionName, out var summary))
{
summary = default;
}
switch (metric.ExecutionStage)
{
case ExecutionStage.Started:
summary.StartedCount++;
break;
case ExecutionStage.Failed:
summary.FailedCount++;
break;
case ExecutionStage.Succeeded:
summary.SucceededCount++;
break;
}
summary.TotalExectionTimeInMs += metric.ExecutionTimeInMS;
aggregatedEventsPerFunction[metric.FunctionName] = summary;
}
foreach (var functionEvent in aggregatedEventsPerFunction)
{
GC.KeepAlive(functionEvent);
}
}
[Benchmark]
public void DictionaryMarshallMethod()
{
var aggregatedEventsPerFunction = new Dictionary<string, MetricSummary>();
foreach (var metric in _functionMetricsQueue)
//while (_functionMetricsQueue.TryDequeue(out var metric)) // don't purge benchmark data...
{
ref var summary = ref CollectionsMarshal.GetValueRefOrAddDefault(aggregatedEventsPerFunction, metric.FunctionName, out _);
switch (metric.ExecutionStage)
{
case ExecutionStage.Started:
summary.StartedCount++;
break;
case ExecutionStage.Failed:
summary.FailedCount++;
break;
case ExecutionStage.Succeeded:
summary.SucceededCount++;
break;
}
summary.TotalExectionTimeInMs += metric.ExecutionTimeInMS;
}
foreach (var functionEvent in aggregatedEventsPerFunction)
{
GC.KeepAlive(functionEvent);
}
}
private struct MetricSummary
{
public long StartedCount;
public long FailedCount;
public long SucceededCount;
public long TotalExectionTimeInMs;
}
}
public readonly struct FunctionMetrics
{
public readonly long ExecutionTimeInMS;
public readonly ExecutionStage ExecutionStage;
public readonly string FunctionName;
public FunctionMetrics(string functionName, ExecutionStage executionStage, long executionTimeInMS)
{
FunctionName = functionName;
ExecutionStage = executionStage;
ExecutionTimeInMS = executionTimeInMS;
}
}
public enum ExecutionStage : byte
{
Started,
InProgress,
Finished,
Failed,
Succeeded
}
Results:
Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Allocated |
---|---|---|---|---|---|---|
LinqMethod | 1,585.7 μs | 30.33 μs | 36.10 μs | 74.2188 | 27.3438 | 1,228 KB |
DictionaryMethod | 829.9 μs | 13.61 μs | 12.06 μs | - | - | 11 KB |
DictionaryMarshallMethod | 585.1 μs | 11.32 μs | 13.47 μs | - | - | 11 KB |
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
pinging @brettsam & @mathewc for a second pass!
pinging @brettsam & @mathewc for a second pass!
bumping again :)
@fabiocav @mathewc Dean and I see 2 places to change to interlock (107/132) to get more accuracy than the existing (pre-PR) code which can drop some events, but I don't want to set everything up again if we're not going to do anything with it anyway. Should I just close this PR or try to update it later? There's a high time cost to re-profiling, so asking before chasing anything here.
@NickCraver please go ahead with the changes. I've discussed this with the team and assigned some people to help drive the review (also including @mathewc).
@NickCraver please go ahead with the changes. I've discussed this with the team and assigned some people to help drive the review (also including @mathewc).
@NickCraver any updates on this?
@liliankasem @fabiocav I'm swamped in Antares for the foreseeable future and don't have this setup after a machine pave. The totality of the changes I'm talking about are changing queuedEvent.Count++;
and queuedEvent.Value.Count++;
to Interlocked.Increment
for total accuracy, is that something the team can take as part of this? (should have push permissions). It'll be a while before I get back to Functions in general with the current queue - apologies!
@liliankasem @fabiocav I'm swamped in Antares for the foreseeable future and don't have this setup after a machine pave. The totality of the changes I'm talking about are changing
queuedEvent.Count++;
andqueuedEvent.Value.Count++;
toInterlocked.Increment
for total accuracy, is that something the team can take as part of this? (should have push permissions). It'll be a while before I get back to Functions in general with the current queue - apologies!
No worries at all, thank you for updating us! I'll sync with @fabiocav about what we can do to get this merged from our side
Opened issue to address this in v5: https://github.com/Azure/azure-functions-host/issues/9173