runtime icon indicating copy to clipboard operation
runtime copied to clipboard

PGO: Add new tiers

Open EgorBo opened this issue 2 years ago • 45 comments

This PR implements @jkotas's idea in https://github.com/dotnet/runtime/issues/70410#issuecomment-1149874663 when DOTNET_TieredPGO is enabled (it's off by default and will be so in .NET 7.0)

  1. Use R2R code for process startup
  2. Once the process startups up, switch to instrumented code for hot methods
  3. Once you collect enough PGO data, create optimized JITed version

Design

flowchart
    prestub(.NET Function) -->|Compilation| hasAO{"Marked with<br/>[AggressiveOpts]?"}
    hasAO-->|Yes|tier1ao["JIT to <b><ins>Tier1</ins></b><br/><br/>(that attribute is extremely<br/> rarely a good idea)"]
    hasAO-->|No|hasR2R
    hasR2R{"Is prejitted (R2R)<br/>and ReadyToRun==1"?} -->|No| istrTier0Q

    istrTier0Q{"<b>TieredPGO_Strategy:</b><br/>Instrument only<br/>hot Tier0 code?"}
    istrTier0Q-->|No, always instrument tier0|tier0
    istrTier0Q-->|Yes, only hot|tier000
    tier000["JIT to <b><ins>Tier0</ins></b><br/><br/>(not optimized, not instrumented,<br/> with patchpoints)"]-->|Running...|ishot555
    ishot555{"Is hot?<br/>(called >30 times)"}
    ishot555-.->|No,<br/>keep running...|ishot555
    ishot555-->|Yes|tier0
   
    hasR2R -->|Yes| R2R
    R2R["Use <b><ins>R2R</ins></b> code<br/><br/>(optimized, not instrumented,<br/>with patchpoints)"] -->|Running...|ishot1
    ishot1{"Is hot?<br/>(called >30 times)"}-.->|No,<br/>keep running...|ishot1
    ishot1--->|"Yes"|instrumentR2R

    instrumentR2R{"<b>TieredPGO_Strategy:</b><br/>Instrument hot<br/>R2R'd code?"}
    instrumentR2R-->|Yes, instrument R2R'd code|istier1inst
    instrumentR2R-->|No, don't instrument R2R'd code|tier1nopgo["JIT to <b><ins>Tier1</ins></b><br/><br/>(no dynamic profile data)"]

    tier0["JIT to <b><ins>InstrumentedTier</ins></b><br/><br/>(not optimized, instrumented,<br/> with patchpoints)"]-->|Running...|ishot5
    tier1pgo2["JIT to <b><ins>Tier1</ins></b><br/><br/>(optimized with profile data)"]
    tier1pgo2_1["JIT to <b><ins>Tier1</ins></b><br/><br/>(optimized with profile data)"]
      
    istier1inst{"<b>TieredPGO_Strategy:</b><br/>Enable optimizations<br/>for InstrumentedTier?"}-->|"No"|tier0_1
    istier1inst--->|"Yes"|tier1inst["JIT to <b><ins>InstrumentedTierOptimized</ins></b><br/><br/>(optimized, instrumented, <br/>with patchpoints)"]
    tier1inst-->|Running...|ishot5_1
    ishot5{"Is hot?<br/>(called >30 times)"}-->|Yes|tier1pgo2
    ishot5-.->|No,<br/>keep running...|ishot5

    
    ishot5_1{"Is hot?<br/>(called >30 times)"}
    ishot5_1-.->|No,<br/>keep running...|ishot5_1
    ishot5_1{"Is hot?<br/>(called >30 times)"}-->|Yes|tier1pgo2_1

    tier0_1["JIT to <b><ins>InstrumentedTier</ins></b><br/><br/>(not optimized, instrumented,<br/> with patchpoints)"]
    tier0_1-->|Running...|ishot5_1
    
    style istrTier0Q fill:#c3e3ba
    style instrumentR2R fill:#c3e3ba
    style istier1inst fill:#c3e3ba
    style tier000 fill:#c3e3ba
    style tier0_1 fill:#c3e3ba
    style ishot5_1 fill:#c3e3ba
    style ishot555 fill:#c3e3ba
    style tier1inst fill:#c3e3ba
    style tier1pgo2_1 fill:#c3e3ba

(this PR adds green blocks)

Benchmarks

DOTNET_TieredPGO=1 is enabled for both Main and PR:

image

EgorBo avatar Jun 18 '22 19:06 EgorBo

cc @noahfalk @kouvel

jkotas avatar Jun 18 '22 19:06 jkotas

Current results: image

Both base and diff use a locally built coreclr that matches SDK (latest daily) and both use DOTNET_TieredPGO=1. base results match the official results - OK. diff is much slower for some reason but the fact that it compiled more functions makes perfect sense. I am downloading traces for both cases now to analyze jit events. It feels like some functions never make it to tier1 (I tried to play with various knobs like callcounting, osr, etc)

EgorBo avatar Jun 18 '22 20:06 EgorBo

Ah looks like my approach to check whether we compile from R2R is wrong

Assert failure(PID 31732 [0x00007bf4], Thread: 43456 [0xa9c0]): !MethodDescBackpatchInfoTracker::IsLockOwnedByCurrentThread() || IsInForbidSuspendForDebuggerRegion()

CORECLR! Thread::RareDisablePreemptiveGC + 0x131 (0x00007ff8`deafb481)
CORECLR! Thread::DisablePreemptiveGC + 0xC5 (0x00007ff8`de562d75)
CORECLR! GCCoopHackNoThread::GCCoopHackNoThread + 0x113 (0x00007ff8`de6c2393)
CORECLR! HashMap::LookupValue + 0x172 (0x00007ff8`de6c37a2)
CORECLR! ReadyToRunInfo::GetMethodDescForEntryPointInNativeImage + 0x1CD (0x00007ff8`de86fa1d)
CORECLR! ReadyToRunInfo::GetMethodDescForEntryPoint + 0xBB (0x00007ff8`de86f7fb)
CORECLR! TieredCompilationManager::AsyncPromoteToTier1 + 0x22D (0x00007ff8`de81f88d)

I guess it kills the background thread and nothing is promoted to tier1 after

EgorBo avatar Jun 18 '22 21:06 EgorBo

So far I don't see any errors in the logs but it seems in a highly concurrent environment (I managed to create a small repro) it stops promoting such methods to tier1 and I don't understand why, keep looking..

https://gist.github.com/EgorBo/01b84f56e081724cf74b2518fe8b3d7c

Although, doesn't repro when debugger is attached so must be some timing issue. I tried e.g. TrySetCodeEntryPointAndRecordMethodForCallCounting etc but it seems OnCallCountThresholdReached is not invoked so maybe a callcounting stub is not created?

PS: CI failures seem known and unrelated

EgorBo avatar Jun 18 '22 23:06 EgorBo

Looks like the problem that I have to reset callcounting cell, change its stub and its state to keep it alive.

or the way it is implemented now it ends up with two callcounting functions between callee and caller?

EgorBo avatar Jun 19 '22 14:06 EgorBo

Correct me if I am wrong but here is what happening:

Imagine A() calls B() which is R2R'd:

A() -> [precode] -> [callcounting] -> R2R'd B()

after 30 calls we get this (with this PR):

A() -> [precode] -> tier0 B()

from my understanding I have to either:

  1. Allocate a new call-counting stub and patch [precode] again
  2. "Patch" the initial [callcounting] stub to look at tier0 instead of r2r and reset the call-counting-cell

image

EgorBo avatar Jun 19 '22 22:06 EgorBo

2. "Patch" the initial [callcounting] stub to look at tier0 instead of r2r and reset the call-counting-cell

It sounds better to me, as we never delete the call counting stubs, so this would leave less garbage.

janvorli avatar Jun 20 '22 11:06 janvorli

When I was doing tiered compilation originally one of the benchmarks I found helpful was MusicStore and I had the app measure its own latency for requests 0-500, 501-1000, 1000-1500, and so on. This helped me get an idea how quickly an app was able to converge to the steady state behavior. Completely up to you if a similar analysis would be useful now. One hypothesis I'd have for the worse TE numbers is that the benchmark might be short enough that it is capturing a substantial amount of pre-steady-state behavior.

noahfalk avatar Jun 20 '22 20:06 noahfalk

When I was doing tiered compilation originally one of the benchmarks I found helpful was MusicStore and I had the app measure its own latency for requests 0-500, 501-1000, 1000-1500, and so on. This helped me get an idea how quickly an app was able to converge to the steady state behavior. Completely up to you if a similar analysis would be useful now. One hypothesis I'd have for the worse TE numbers is that the benchmark might be short enough that it is capturing a substantial amount of pre-steady-state behavior.

Thanks! When I will be patching the call-counting stub we might consider resetting the call-counting-cell to some smaller number (e.g. 10)

EgorBo avatar Jun 20 '22 20:06 EgorBo

Just realized that I can also introduce a new tier for non-r2r cases:

tier0 -> instrumented tier0 -> tier1

This solves a different problem that we have now - instrumentation is quite heavy (both in terms of TP and perf). However, as Andy noted, we need to be careful around OSR.

I have a demo locally, for now I am allocating a new callcounting stub every time because it's simpler

EgorBo avatar Jun 21 '22 22:06 EgorBo

For FullPGO mode where we don't have R2R my prototype shows +10% improvement for "Startup time + time to first request" while maintaining the same RPS as expected. Something like this: image

(it's tier0 -> 30 calls -> tier0 with instrumentation -> 30 calls -> tier1) - worth trying different thresholds in-between

EgorBo avatar Jun 21 '22 23:06 EgorBo

DOTNET_TieredPGO=1

| application           | eeeg_plaintext_mvc_Pgo_0_base | eeeg_plaintext_mvc_Pgo_0_diff |        |
| --------------------- | ----------------------------- | ----------------------------- | ------ |
| CPU Usage (%)         |                            99 |                           100 | +1.01% |
| Cores usage (%)       |                         2,764 |                         2,799 | +1.27% |
| Working Set (MB)      |                           577 |                           580 | +0.52% |
| Private Memory (MB)   |                         1,770 |                         1,770 |  0.00% |
| Build Time (ms)       |                         3,708 |                         3,741 | +0.89% |
| Start Time (ms)       |                           267 |                           268 | +0.37% |
| Published Size (KB)   |                       111,441 |                       111,441 |  0.00% |
| .NET Core SDK Version |     7.0.100-preview.6.22316.8 |     7.0.100-preview.6.22316.8 |        |
| load                   | eeeg_plaintext_mvc_Pgo_0_base | eeeg_plaintext_mvc_Pgo_0_diff |         |
| ---------------------- | ----------------------------- | ----------------------------- | ------- |
| CPU Usage (%)          |                            30 |                            34 | +13.33% |
| Cores usage (%)        |                           827 |                           941 | +13.78% |
| Working Set (MB)       |                            38 |                            38 |   0.00% |
| Private Memory (MB)    |                           358 |                           358 |   0.00% |
| Start Time (ms)        |                             0 |                             0 |         |
| First Request (ms)     |                           136 |                           136 |   0.00% |
| Requests/sec           |                     2,263,203 |                     2,892,576 | +27.81% |
| Requests               |                    34,173,291 |                    43,675,784 | +27.81% |
| Mean latency (ms)      |                          1.14 |                          0.92 | -19.30% |
| Max latency (ms)       |                         81.62 |                         50.21 | -38.48% |
| Bad responses          |                             0 |                             0 |         |
| Socket errors          |                             0 |                             0 |         |
| Read throughput (MB/s) |                        284.90 |                        364.13 | +27.81% |
| Latency 50th (ms)      |                          1.05 |                          0.80 | -23.52% |
| Latency 75th (ms)      |                          1.53 |                          1.18 | -22.88% |
| Latency 90th (ms)      |                          2.01 |                          1.58 | -21.39% |
| Latency 99th (ms)      |                          0.00 |                          0.00 |         |

+25-28% RPS (and Latency) improvement without regressions for First Request (ms). So basically we get "FullPGO"-level of performance while maintaining the same startup time/time to first request as it was before.

DOTNET_TieredPGO=1 and DOTNET_ReadyToRun=0 (aka FullPGO mode)

| application           | eeeg_plaintext_mvc_FullPgo_1_base | eeeg_plaintext_mvc_FullPgo_1_diff |        |
| --------------------- | --------------------------------- | --------------------------------- | ------ |
| CPU Usage (%)         |                                96 |                                98 | +2.08% |
| Cores usage (%)       |                             2,699 |                             2,758 | +2.19% |
| Working Set (MB)      |                               568 |                               572 | +0.70% |
| Private Memory (MB)   |                             1,766 |                             1,770 | +0.23% |
| Build Time (ms)       |                             3,748 |                             3,769 | +0.56% |
| Start Time (ms)       |                               770 |                               701 | -8.96% |
| Published Size (KB)   |                           111,441 |                           111,441 |  0.00% |
| .NET Core SDK Version |         7.0.100-preview.6.22316.8 |         7.0.100-preview.6.22316.8 |        |
| load                   | eeeg_plaintext_mvc_FullPgo_1_base | eeeg_plaintext_mvc_FullPgo_1_diff |         |
| ---------------------- | --------------------------------- | --------------------------------- | ------- |
| CPU Usage (%)          |                                34 |                                34 |   0.00% |
| Cores usage (%)        |                               966 |                               952 |  -1.45% |
| Working Set (MB)       |                                38 |                                38 |   0.00% |
| Private Memory (MB)    |                               358 |                               358 |   0.00% |
| Start Time (ms)        |                                 0 |                                 0 |         |
| First Request (ms)     |                               309 |                               285 |  -7.77% |
| Requests/sec           |                         2,987,104 |                         2,949,697 |  -1.25% |
| Requests               |                        45,104,462 |                        44,539,799 |  -1.25% |
| Mean latency (ms)      |                              0.89 |                              0.90 |  +1.12% |
| Max latency (ms)       |                             55.03 |                             44.60 | -18.95% |
| Bad responses          |                                 0 |                                 0 |         |
| Socket errors          |                                 0 |                                 0 |         |
| Read throughput (MB/s) |                            376.03 |                            371.32 |  -1.25% |
| Latency 50th (ms)      |                              0.82 |                              0.80 |  -2.44% |
| Latency 75th (ms)      |                              1.20 |                              1.17 |  -2.50% |
| Latency 90th (ms)      |                              1.77 |                              1.59 | -10.17% |
| Latency 99th (ms)      |                              0.00 |                              0.00 |         |

For Full PGO -9% improvement for Start Time (ms) and -8% for First Request (ms) because we instrument only hot tier0 methods 🙂.

Projected improvements: image

EgorBo avatar Jun 22 '22 13:06 EgorBo

New workflow (hello, github mermaid!):

  • moved to https://github.com/dotnet/runtime/pull/70941#issuecomment-1163471136

The only problem that we don't instrument tier0 methods with loops so when we decide to do OSR for a loop we compile to OSR Tier1 without any profile data 😢 There are few solutions:

  1. Always instrument methods with loops (see comment around fgSwitchToOptimized) - however, R2R'd methods won't benefit from this
  2. Do two OSR promotions: tier0 -> tier0 OSR with instrumentation -> tier1 OSR -> tier0 with instrumentation -> tier1 (sounds complicated - 5 tiers)
  3. Leave as is for now - at least we won't stuck in an unoptimized code any way and the host method still might be re-jitted for instrumentation

PS: Technically a function can go through 4 different tiers what makes RyuJIT a 4-tier compiler 😄!

EgorBo avatar Jun 22 '22 13:06 EgorBo

We can't currently leave Tier1-OSR and get back to Tier0 (at least mid-method; if the method is called every so often we could switch at a call boundary. But it wouldn't help say Main which is only called once).

What we could do instead is go from Tier0 (uninstrumented) to Tier0-OSR (instrumented) and then to Tier1-OSR (instrumented). This would give some PGO data, but we might not see the early parts of the method execute with instrumentation.

AndyAyersMS avatar Jun 22 '22 16:06 AndyAyersMS

I assume what you suggest looks like this:

graph TD
    prestub(.NET Function) --> hasAO{"Marked with<br/>[AggressiveOpts]?"}
    hasAO-->|Yes|tier1ao[JIT to Tier1]
    hasAO-->|No|hasR2R
    hasR2R{"Is AOT'd <br/> (R2R)"?} -->|No| tier0[JIT to Tier0]
    tier0 --> osr{Is OSR<br /> requested?}
    hasR2R -->|Yes and the whole<br/>method is invoked<br/>>30 times| tier0inst
    osr -->|Yes, e.g. hot loop<br/>in the method| tier0osr[JIT to Tier0-OSR<br/>with instrumentation] 
    tier0osr --> osr2{Is OSR<br/>requested?}
    osr2 -->|No, but the whole<br/>method is invoked<br/>>30 times|tier0inst
    osr2 -->|Yes, e.g. hot loop<br/>in the method| tier1osr[JIT to Tier1-OSR<br/>using profile data] 
    osr -->|No, but the whole<br/>method is invoked<br/>>30 times| tier0inst
    tier1osr-->|method is invoked<br/>>30 times| tier0inst["JIT to Tier0 <br/>with instrumentation<br/>No OSR"]
    tier0inst --> |method is invoked<br/>>30 times|tier1pgo[JIT to Tier1<br/>using profile data]

EgorBo avatar Jun 22 '22 18:06 EgorBo

Sort of? I can post what I think of as the right flow in a bit.

AndyAyersMS avatar Jun 22 '22 18:06 AndyAyersMS

Ok, so I decided to limit this PR to only "Hot R2R to Instrumented Tier0" and leave "Tier0 to Instrumented Tier0" for a separate PR as it significantly complicates flow (mainly due to OSR) and test matrix, for some methods it might end up with 5 different versions per single method:

  1. Uninstrumented Tier0 with patchpoints
  2. Uninstrumented Tier0 with Instrumented Tier0-OSR with patchpoints
  3. Uninstrumented Tier0 with Optimized-with-profile Tier1-OSR without patchpoints
  4. Instrumented Tier0 without patchpoints
  5. Optimized-with-profile Tier1

(in the order of promotion)

So this PR still maintains improvements shown in https://github.com/dotnet/runtime/pull/70941#issuecomment-1163069395 for TieredPGO=1 (~25% more RPS for PlaintexMVC), just doesn't improve start up time yet (left for the follow up PR)

The current flow looks like this when DOTNET_TieredPGO=1 is set:

graph TD
    prestub(.NET Function) -->|Compilation| hasAO{"Marked with<br/>[AggressiveOpts]?"}
    hasAO-->|Yes|tier1ao["JIT to <b><ins>Tier1</ins></b><br/><br/>(that attribute is extremely<br/> rarely a good idea)"]
    hasAO-->|No|hasR2R
    hasR2R{"Is prejitted (R2R)<br/>and ReadyToRun==1"?} -->|No| tier0
   
    hasR2R -->|Yes| R2R
    R2R["Use <b><ins>R2R</ins></b> code<br/><br/>(optimized, no patchpoints,<br/>not instrumented)"] -->|Running...|ishot1
    ishot1{"Is hot?<br/>(called >30 times)"}-.->|No,<br/>keep running...|ishot1
    ishot1-->|Yes|istier1inst
    tier0["JIT to <b><ins>Tier0-Instr</ins></b><br/><br/>(instrumented,<br/>with patchpoints)"] -->|Running...| osr
    osr{A patchpoint is hit<br/>and triggered OSR?} -->|Yes| tier1osr
    osr -->|No| ishot3
    ishot3{"Is hot?<br/>(called >30 times)"} -.->|No,<br/>keep running...| osr
    ishot3 -->|Yes|tier1pgo1
    tier1osr["JIT to <b><ins>Tier1-OSR-Instr</ins></b><br/><br/>(method is a mix of Tier0 + Tier1-OSR.<br/>OSR'd part is optimized with profile.<br/>Both parts are instrumented"] -->|Running...|ishot4
    ishot4{"Is hot?<br/>(called >30 times)"}-->|Yes|tier1pgo1
    ishot4-.->|No,<br/>keep running...|ishot4
    tier1pgo1["JIT to <b><ins>Tier1</ins></b><br/><br/>(optimized with profile data)"]
    tier1pgo2["JIT to <b><ins>Tier1</ins></b><br/><br/>(optimized with profile data)"]
      
    istier1inst{"Is Instrumentation<br/>for Tier1 allowed?<br/>(default: yes)<br/>"}-.->|No|tier0
    istier1inst-->|Yes|tier1inst["JIT to <b><ins>Tier1-Instr</ins></b><br/><br/>(optimized, instrumented, <br/>no patchpoints)"]
    tier1inst-->ishot5
    ishot5{"Is hot?<br/>(called >30 times)"}-->|Yes|tier1pgo2
    ishot5-.->|No,<br/>keep running...|ishot5

    style ishot5 fill:#c3e3ba
    style istier1inst fill:#c3e3ba
    style tier1inst fill:#c3e3ba

(This PR added green blocks)

EgorBo avatar Jun 23 '22 12:06 EgorBo

PTAL @AndyAyersMS @janvorli @noahfalk @kouvel

EgorBo avatar Jun 23 '22 14:06 EgorBo

Ah, it turns out I didn't patch CallCountingStub to use the new version. Fixed Also I addressed @AndyAyersMS's concerns about R2R -> Tier0 not being OSR friendly - fixed and the diagram is updated.

Technically a potentially hot method doesn't really need OSR because once it's OSRed it starts to collect less accurate profile since it instrumented optimized code. But to avoid potential "cold tier0 in a hot loop" trap I enabled OSR back (see the diagram ^)

EgorBo avatar Jun 24 '22 09:06 EgorBo

@AndyAyersMS good news!! R2R -> Tier1-instr -> Tier1 works well, at least for the Plaintext MVC - same RPS improvement (+25%) 🎉

And since the instrumentation tier is now able to inline methods it produces much less of new compilations (as you were worried about)👍

Added Tier1-inst path to the diagram

EgorBo avatar Jun 24 '22 12:06 EgorBo

Plaintext-MVC (citrine-linux-x64)

TieredPGO=1 (Main) vs TieredPGO=1 (this PR)

| application                             | pgo-old                   | pgo                       |         |
| --------------------------------------- | ------------------------- | ------------------------- | ------- |
| CPU Usage (%)                           |                        99 |                       100 |  +1.01% |
| Cores usage (%)                         |                     2,772 |                     2,798 |  +0.94% |
| Working Set (MB)                        |                       574 |                       580 |  +1.05% |
| Private Memory (MB)                     |                     1,775 |                     1,781 |  +0.34% |
| Build Time (ms)                         |                     3,861 |                     3,915 |  +1.40% |
| Start Time (ms)                         |                       282 |                       271 |  -3.90% |
| Published Size (KB)                     |                   111,441 |                   111,441 |   0.00% |
| .NET Core SDK Version                   | 7.0.100-preview.6.22316.8 | 7.0.100-preview.6.22316.8 |         |
| Max CPU Usage (%)                       |                        99 |                        99 |   0.00% |
| Max Working Set (MB)                    |                       606 |                       607 |  +0.16% |
| Max GC Heap Size (MB)                   |                       319 |                       305 |  -4.33% |
| Size of committed memory by the GC (MB) |                       444 |                       443 |  -0.31% |
| Max Number of Gen 0 GCs / sec           |                     11.00 |                     14.00 | +27.27% |
| Max Number of Gen 1 GCs / sec           |                      1.00 |                      1.00 |   0.00% |
| Max Number of Gen 2 GCs / sec           |                      1.00 |                      1.00 |   0.00% |
| Max Time in GC (%)                      |                      1.00 |                      1.00 |   0.00% |
| Max Gen 0 Size (B)                      |                 1,601,256 |                       528 | -99.97% |
| Max Gen 1 Size (B)                      |                 6,731,296 |                 8,401,376 | +24.81% |
| Max Gen 2 Size (B)                      |                 3,687,440 |                 3,668,656 |  -0.51% |
| Max LOH Size (B)                        |                    98,384 |                    98,384 |   0.00% |
| Max POH Size (B)                        |                 1,987,640 |                 2,041,200 |  +2.69% |
| Max Allocation Rate (B/sec)             |             3,216,545,432 |             3,881,553,736 | +20.67% |
| Max GC Heap Fragmentation               |                        17 |                         2 | -90.35% |
| # of Assemblies Loaded                  |                       111 |                       111 |   0.00% |
| Max Exceptions (#/s)                    |                       455 |                       462 |  +1.54% |
| Max Lock Contention (#/s)               |                       313 |                       406 | +29.71% |
| Max ThreadPool Threads Count            |                        48 |                        48 |   0.00% |
| Max ThreadPool Queue Length             |                       208 |                       195 |  -6.25% |
| Max ThreadPool Items (#/s)              |                   170,009 |                   213,446 | +25.55% |
| Max Active Timers                       |                         0 |                         0 |         |
| IL Jitted (B)                           |                   208,764 |                   296,695 | +42.12% |
| Methods Jitted                          |                     2,402 |                     3,235 | +34.68% |


| load                   | pgo-old    | pgo        |         |
| ---------------------- | ---------- | ---------- | ------- |
| CPU Usage (%)          |         30 |         34 | +13.33% |
| Cores usage (%)        |        844 |        938 | +11.14% |
| Working Set (MB)       |         38 |         38 |   0.00% |
| Private Memory (MB)    |        358 |        358 |   0.00% |
| Start Time (ms)        |          0 |          0 |         |
| First Request (ms)     |        141 |        135 |  -4.26% |
| Requests/sec           |  2,366,975 |  2,849,676 | +20.39% |
| Requests               | 35,740,018 | 43,028,516 | +20.39% |
| Mean latency (ms)      |       1.08 |       0.92 | -14.81% |
| Max latency (ms)       |      42.30 |      52.86 | +24.96% |
| Bad responses          |          0 |          0 |         |
| Socket errors          |          0 |          0 |         |
| Read throughput (MB/s) |     297.97 |     358.73 | +20.39% |
| Latency 50th (ms)      |       0.97 |       0.82 | -14.95% |
| Latency 75th (ms)      |       1.42 |       1.21 | -14.79% |
| Latency 90th (ms)      |       1.84 |       1.62 | -11.96% |
| Latency 99th (ms)      |       0.00 |       0.00 |         |

(instrumented tier is tier1) Pure wins for TieredPGO=1 +20% more PRS/Latency. More methods jitted because previously we didn't instrument R2R'd code - it went to Tier1 after R2R

Default (Main) vs TieredPGO (this PR)

| application                             | nopgo                     | pgo                       |         |
| --------------------------------------- | ------------------------- | ------------------------- | ------- |
| CPU Usage (%)                           |                        99 |                       100 |  +1.01% |
| Cores usage (%)                         |                     2,778 |                     2,798 |  +0.72% |
| Working Set (MB)                        |                       573 |                       580 |  +1.22% |
| Private Memory (MB)                     |                     1,773 |                     1,781 |  +0.45% |
| Build Time (ms)                         |                     4,151 |                     3,915 |  -5.69% |
| Start Time (ms)                         |                       271 |                       271 |   0.00% |
| Published Size (KB)                     |                   111,441 |                   111,441 |   0.00% |
| .NET Core SDK Version                   | 7.0.100-preview.6.22316.8 | 7.0.100-preview.6.22316.8 |         |
| Max CPU Usage (%)                       |                        99 |                        99 |   0.00% |
| Max Working Set (MB)                    |                       605 |                       607 |  +0.35% |
| Max GC Heap Size (MB)                   |                       256 |                       305 | +19.23% |
| Size of committed memory by the GC (MB) |                       445 |                       443 |  -0.52% |
| Max Number of Gen 0 GCs / sec           |                     11.00 |                     14.00 | +27.27% |
| Max Number of Gen 1 GCs / sec           |                      1.00 |                      1.00 |   0.00% |
| Max Number of Gen 2 GCs / sec           |                      1.00 |                      1.00 |   0.00% |
| Max Time in GC (%)                      |                     22.00 |                      1.00 | -95.45% |
| Max Gen 0 Size (B)                      |                       528 |                       528 |   0.00% |
| Max Gen 1 Size (B)                      |                 6,362,160 |                 8,401,376 | +32.05% |
| Max Gen 2 Size (B)                      |                 3,679,600 |                 3,668,656 |  -0.30% |
| Max LOH Size (B)                        |                    98,384 |                    98,384 |   0.00% |
| Max POH Size (B)                        |                 2,016,480 |                 2,041,200 |  +1.23% |
| Max Allocation Rate (B/sec)             |             3,037,975,504 |             3,881,553,736 | +27.77% |
| Max GC Heap Fragmentation               |                         1 |                         2 | +10.75% |
| # of Assemblies Loaded                  |                       111 |                       111 |   0.00% |
| Max Exceptions (#/s)                    |                       464 |                       462 |  -0.43% |
| Max Lock Contention (#/s)               |                       287 |                       406 | +41.46% |
| Max ThreadPool Threads Count            |                        48 |                        48 |   0.00% |
| Max ThreadPool Queue Length             |                       203 |                       195 |  -3.94% |
| Max ThreadPool Items (#/s)              |                   162,310 |                   213,446 | +31.51% |
| Max Active Timers                       |                         0 |                         0 |         |
| IL Jitted (B)                           |                   208,982 |                   296,695 | +41.97% |
| Methods Jitted                          |                     2,408 |                     3,235 | +34.34% |


| load                   | nopgo      | pgo        |         |
| ---------------------- | ---------- | ---------- | ------- |
| CPU Usage (%)          |         29 |         34 | +17.24% |
| Cores usage (%)        |        815 |        938 | +15.09% |
| Working Set (MB)       |         38 |         38 |   0.00% |
| Private Memory (MB)    |        358 |        358 |   0.00% |
| Start Time (ms)        |          0 |          0 |         |
| First Request (ms)     |        136 |        135 |  -0.74% |
| Requests/sec           |  2,241,138 |  2,849,676 | +27.15% |
| Requests               | 33,840,597 | 43,028,516 | +27.15% |
| Mean latency (ms)      |       1.15 |       0.92 | -20.00% |
| Max latency (ms)       |      49.11 |      52.86 |  +7.64% |
| Bad responses          |          0 |          0 |         |
| Socket errors          |          0 |          0 |         |
| Read throughput (MB/s) |     282.13 |     358.73 | +27.15% |
| Latency 50th (ms)      |       1.07 |       0.82 | -22.90% |
| Latency 75th (ms)      |       1.56 |       1.21 | -22.44% |
| Latency 90th (ms)      |       2.07 |       1.62 | -21.74% |
| Latency 99th (ms)      |       0.00 |       0.00 |         |

No regressions around start-time/time-to-first-request. Much better RPS and Latency for PGO. +34% more methods jitted Although, the benchmark is too simple to judge TieredPGO vs Default.

When we ask JIT to use Tier0 for instrumentation (DOTNET_TC_InstrumentOptimizedCode=0) we jit even more methods - 4000

EgorBo avatar Jun 24 '22 13:06 EgorBo

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

EgorBo avatar Jun 24 '22 14:06 EgorBo

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Jun 24 '22 14:06 azure-pipelines[bot]

7-8% more RPS for YARP proxy benchmark TieredPGO=1(Main) vs TieredPGO=1(PR)

EgorBo avatar Jun 24 '22 14:06 EgorBo

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

EgorBo avatar Jun 26 '22 10:06 EgorBo

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Jun 26 '22 10:06 azure-pipelines[bot]

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

EgorBo avatar Jun 26 '22 12:06 EgorBo

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Jun 26 '22 12:06 azure-pipelines[bot]

Any feedback? The PR is ready for the case when instrumented tier is not optimized and needs a few touches for the optimizated one (around tail calls + block counters) but I'd love to hear some feedback first.

So this PR improves only DOTNET_TieredPGO=1 mode (not enabled by default in Main) - see benchmark results in the first comment, however, In theory we could enable DOTNET_TieredPGO=1 for only R2R'd methods in .NET 7.0 - it e.g. will give us +8% more RPS for YARP proxy while maintaining the same start up speed. In that case we need to prioritize broader testing for PGO.

PS: I think it makes sense to call it 3-tiers so the last (most optimized) tier will be Tier2 - and refactor mentions of tier1

EgorBo avatar Jun 27 '22 12:06 EgorBo

Sorry I was out sick. @kouvel is definitely the one we'd want to review this but he is out on vacation until next week.

Nice perf results! : )

noahfalk avatar Jun 29 '22 22:06 noahfalk

@kouvel @davidwrighton @AndyAyersMS

I think I've addressed all of your concerns & feedback

  1. I made the behavior completely opt-in at this point (so when user only enables TieredPGO=1 they will also need to change DOTNET_TieredPGO_Strategy to observe the new behavior).
  2. InstrumentationTier is unoptimized by default - I'll work separately on debugging small jit issues in InstrumentationTierOptimized (around tail calls)
  3. I'll file quick PRs similar to dotnet/diagnostics/pull/2928 and microsoft/perfview/pull/1584 to those repos to add new tiers there as well if you approve the current PR.
  4. I added docs
  5. I don't patch existing call-counting-stubs to avoid race conditions for now, will investigate how much memory they consume on close-to-real-world projects but I assume not much - they're pretty small

EgorBo avatar Jul 15 '22 16:07 EgorBo

I don't patch existing call-counting-stubs to avoid race conditions for now, will investigate how much memory they consume on close-to-real-world projects but I assume not much - they're pretty small

They can add up to a small but not-insignificant amount. Call counting stubs/infos used to be deleted, at the moment in .NET 7 they are not deleted. I think it would be reasonable to use different call counting stubs for now and make them deletable again in the future (perhaps leaking much less memory than currently).

kouvel avatar Jul 15 '22 20:07 kouvel

I won't be able to fully review this soon as I'll be on vacation, I'll take a closer look when I get back in a couple of weeks. cc @mangod9

kouvel avatar Jul 15 '22 20:07 kouvel

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

EgorBo avatar Jul 17 '22 10:07 EgorBo

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Jul 17 '22 10:07 azure-pipelines[bot]

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

EgorBo avatar Jul 17 '22 12:07 EgorBo

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Jul 17 '22 12:07 azure-pipelines[bot]

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Jul 17 '22 19:07 azure-pipelines[bot]

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Jul 17 '22 19:07 azure-pipelines[bot]

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

EgorBo avatar Aug 07 '22 08:08 EgorBo

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Aug 07 '22 08:08 azure-pipelines[bot]

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

EgorBo avatar Aug 07 '22 13:08 EgorBo

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Aug 07 '22 13:08 azure-pipelines[bot]

@kouvel thanks for the detailed feedback - I think I've addressed everything, can you take a look again?

I wanted to land it in .NET 7.0 since it doesn't change default mode (and when only TieredPGO is set) but gives users/1st parties ability to experiment with more complex PGO strategies.

EgorBo avatar Aug 07 '22 14:08 EgorBo

@EgorBo comments in https://github.com/dotnet/runtime/issues/70410 are already closed, so I'll comment here. For full pgo mode you can try to use MultiCoreJit to improve startup. Profile can be collected without any R2R images or with all R2R images (i.e. first launch will still be fast), and startup during second launch will be improved too.

cc @alpencolt

gbalykov avatar Sep 02 '22 15:09 gbalykov