go icon indicating copy to clipboard operation
go copied to clipboard

runtime: green tea garbage collector

Open mknyszek opened this issue 7 months ago • 22 comments

Green Tea 🍵 Garbage Collector

Authors: Michael Knyszek, Austin Clements Updated: 2 May 2025

This issue tracks the design and implementation of the Green Tea garbage collector. As of the last update to this issue, development of Green Tea is still active. We'll produce more detailed design document once we're ready to commit to a design. For now, Green Tea is available as an experiment at tip-of-tree and is planned as to be available as an opt-in experiment in Go 1.25, once it releases. We encourage teams to try it out.

Introduction

Memory latency and bandwidth are becoming increasingly constrained as CPU clocks outpace DRAM clocks, increasing core counts offer more load on physically limited memory buses, and speed-of-light constraints necessitate increasingly non-uniform memory topologies. As a result, spatial locality, temporal locality, and topology-awareness are becoming ever more critical to high-performance systems.

Unfortunately, all of these trends are at odds with most of today’s garbage collection algorithms. Go’s garbage collector implements a classic tri-color parallel marking algorithm. This is, at its core, just a graph flood, where heap objects are nodes in the graph, and pointers are edges. However, this graph flood affords no consideration to the memory location of the objects that are being processed. As a result, it exhibits extremely poor spatial locality—jumping between completely different parts of memory—poor temporal locality—blithely spreading repeated accesses to the same memory across the GC cycle—and no concern for topology.

As a result, on average 85% of the garbage collector's time is spent in the core loop of this graph flood—the scan loop—and >35% of CPU cycles in the scan loop are spent solely stalled on memory accesses, excluding any knock-on effects. This problem is expected to only get worse as the industry trends toward many-core systems and non-uniform memory architectures.

In this document, we present Green Tea: a parallel marking algorithm that, if not memory-centric,[^1] is at least memory-aware, in that it endeavors to process objects close to one another together.

This new algorithm has an implementation that is ready for developers to trial on their workloads, and in this document we also present the results from evaluating this implementation against our benchmark suite. Overall, the algorithm shows a significant reduction in GC CPU costs on GC-heavy workloads.

Finally, this new marking algorithm unlocks new opportunities for future optimization, such as SIMD acceleration, which we discuss with other possible avenues of future work.

Design

The core idea behind the new parallel marking algorithm is simple. Instead of scanning individual objects, the garbage collector scans memory in much larger, contiguous blocks. The shared work queue tracks these coarse blocks instead of individual objects, and the individual objects waiting to be scanned in a block are tracked in that block itself. The core hypothesis is that while a block waits on the queue to be scanned, it will accumulate more objects to be scanned within that block, such that when a block does get dequeued, it’s likely that scanning will be able to scan more than one object in that block. This, in turn, improves locality of memory access, in addition to better amortizing per-scan costs.

Prototype implementation

In the prototype implementation of this new algorithm, the memory blocks we track are called spans. A span is always some multiple of 8 KiB, always aligned to 8 KiB, and consists entirely of objects of one size. Our prototype focuses exclusively on “small object spans”, which are exactly 8 KiB and contain objects up to 512 bytes.

A span is also the basic unit of storing heap metadata. In the prototype, each span stores two bits for each object: a gray bit and a black bit. These correspond to the tri-color abstraction: an object is black if it has been scanned, gray if it is in the queue to be scanned, and white if it has not been reached at all. In the prototype, white objects have neither bit set, gray objects have the gray bit set, and black objects have both bits set.

When scanning finds a pointer to a small object, it sets that object’s gray bit to indicate the object needs to be scanned. If the gray bit was not already set and the object’s span is not already enqueued for scanning, it enqueues the span. A per-span flag indicates whether the span is currently enqueued so it will only be enqueued once at a time. When the scan loop dequeues a span, it computes the difference between the gray bits and the black bits to identify objects to scan, copies the gray bits to the black bits, and scans any objects that had their gray bit set but not their black bit.

Limiting scope to small objects

The prototype focuses on small objects because we derive the most benefit from them. The per-scan overhead of small objects is much harder to amortize because the garbage collector spends so little time scanning each individual object. Larger objects continue to use the old algorithm.

The choice of which algorithm to use is made when scanning encounters a pointer. The span allocator maintains a bitmap with one bit for each 8 KiB page in the heap to indicate whether that page is backed by a small object span. The footprint of this fits easily into cache even for very large heaps, and contention is extremely low.

Since small object spans are always 8 KiB large and 8 KiB aligned, once the scanner knows the target of a pointer is in a small object span, it can use simple address arithmetic to find the object’s metadata within the span, thus avoiding indirections and dependent loads that seriously harm performance.

Work distribution

Go’s current garbage collector distributes work across scanners by having each scanner maintain a local fixed-sized stack of object pointers. However, in order to ensure parallelism, each scanner aggressively checks and populates global lists. This frequent mutation of the global lists is a significant source of contention in Go programs on many-core systems.

The prototype implementation has a separate queue dedicated to spans and based on the distributed work-stealing runqueues used by the goroutine scheduler. The opportunity for stealing work directly from other workers means less contention on global lists. Furthermore, by queuing spans instead of individual objects, there are far fewer items to queue and thus inherently lower contention on the queues.

Span work may be ordered several different ways. We explored several policies, including FIFO, LIFO, sparsest-first, and densest-first, random, and address-ordered. FIFO turned out to accumulate the highest average density of objects to scan on a span by the time it was dequeued for scanning.

Single-object scan optimization

If a span has only a single object to scan when it is dequeued, the new algorithm will have done more work to handle that single object than the current, object-centric algorithm does.

To bring the performance of the single-object-per-span case more in line with the current marking algorithm, we apply two tricks. First, we track the object that was marked when the span was enqueued. This object becomes the span's representative until the span is scanned. Next, we add a hit flag to the span that indicates an object was marked while the span was queued; that is, that at least two objects are marked. When scanning a span, if the hit flag is not set, then the garbage collector can directly scan the span’s representative, instead of processing the entire span.

Prototype evaluation

We evaluated the prototype implementation across a variety of benchmarks, on low-CPU-count, high-CPU-count, amd64, and arm64 Linux virtual machines. The rest of this section summarizes the results, focusing primarily on the differences in garbage collection CPU cost.

In select GC-heavy microbenchmarks (garbage, from golang.org/x/benchmarks, and binary-trees Go #2, from the Computer Language Benchmarks Game), depending on core count, we observed anywhere from a 10–50% reduction in GC CPU costs compared to the existing Go GC. The improvement generally rose with core count, indicating that the prototype scales better than the existing implementation. Furthermore, the number of L1 and L2 cache misses was reduced by half in these benchmarks.

On our bent and sweet benchmark suites, the results were more varied.

The results are positive overall, but include a mix of improvements and regressions.

Most benchmarks were either unaffected by the changes to the garbage collector or regressed or improved solely due to changes that had little to do with the garbage collector, such as code alignment changes. Some benchmarks regressed even though less CPU time is spent in the garbage collector. One reason for this is because the garbage collector's mark phase is active for less time, leading to less floating garbage which acts as a ballast in some benchmarks. Another reason for this is that less time spent in the garbage collector means more time spent in other scalability bottlenecks, either in the runtime or in user code, leading to a net apparent regression.

The Go compiler benchmarks appear to inconsistently show a very slight regression (0.5%). Given the magnitude and inconsistency of the regression, these benchmarks appear to be rather insensitive to this change. One hypothesis is that the occasional regression may be due to an out-of-date PGO profile, but remains to be investigated.

The tile38 benchmark shows substantial improvements across throughput, latency and memory use. In addition, it showed a 35% reduction in garbage collection overheads. This benchmark queries a local instance of a Tile38 in-memory geospatial database pre-seeded with data. Most of the heap consists of a high-fanout tree, so Green Tea is able to quickly generate large amounts of work and high densities.

The bleve-index benchmark has a heap topology that is quite difficult for Green Tea, though performance overall is a wash. Most of the heap consists of a low-fanout binary tree that is rapidly mutated by the benchmark. Green Tea struggles to generate locality in this benchmark, and half of all span scans only scan a single object. Our current hypothesis is that because of frequent tree rotations, the tree structure itself becomes shuffled across a large heap (100+ MiB). In contrast, the binary-trees benchmark does not perform rotations, so the tree layout retains the good locality of its initial allocations. This suggests that Green Tea has good locality when the application itself has good locality, unlike the current Go GC; but Green Tea, unsurprisingly, can’t create locality out of nothing. Both Linux perf and CPU profiles in the 16-core amd64 environment indicate a small ~2% regression in garbage collection overhead. Overall, the single object scan optimization was integral to making this benchmark perform well in the 16-core amd64 environment. In the 72- and 88-core environments we see a significant improvement, due to the design's improved many-core scalability. There remains a small overall regression on the 16-core arm64 environment that still needs to be investigated.

Future work

SIMD-accelerated scanning kernels

Scanning memory in larger blocks of memory unlocks the ability to apply SIMD to small objects in the garbage collector. The core idea is to generate a unique scanning kernel for each size class and use SIMD bit manipulation and permutation instructions to load, mask, swizzle, pack, and enqueue pointers. The regularity of the layout of objects in a single span and Go’s packed representation of pointer/scalar metadata for small objects both play a major role in making this feasible.

Austin Clements developed prototype AVX512-based scanning kernels that reduce garbage collection overheads by another 15–20% in the benchmarks where we already saw improvements. The prototype implementation does not currently use these kernels because they only apply to a small subset of objects at this point in time.

These SIMD kernels tend to require a higher density of objects in order to outperform sparsely scanning objects within a scan. These kernels are still being developed, so the prototype does not use them by default, and when they are enabled, it only uses SIMD scanning when a minimum density threshold is reached.

Concentrator network

Austin's original design for Green Tea used a sorting network called the concentrator network to achieve the even higher pointer density required by SIMD-based scanning, and to generate locality even for metadata operations like setting gray bits. The network was carefully designed to minimize queuing costs, so individual pointers could still be processed efficiently when sufficient scan density was unavailable.

There are two main reasons we did not pursue this direction in the short term. First, we found that even very low density can produce good results, especially with the single-object scan optimization. Second, the concentrator network is more complex to implement, as it is a greater departure from the existing algorithm. However, this design is still an avenue we plan to explore, since it is far more general and tunable.

Acknowledgements

Credit to Austin Clements for the inception of the algorithm and initial prototyping. (They wrote down the key idea in 2018!)

Credit to Yves Vandriessche from Intel for providing many microarchitectural insights that were vital to making this design viable. Many of his suggestions were applied to the core scan loop, including proper prefetching, batching subsequent pointer loads to hide their memory latency, and simpler iteration over object pointers.

[^1]: "The Garbage Collection Handbook" by Richard Jones, Anthony Hosking, Eliot Moss, a canonical source for garbage collection techniques, divides parallel garbage collection algorithms into two categories, "processor-centric" and "memory-centric." Loosely, the former category consists of algorithms that aggressively balance work across processors to maximize parallelism, while the latter consists of algorithms that may not parallelize as well, but process contiguous blocks of the heap. The current marking algorithm in the Go garbage collector is firmly processor-centric. Unfortunately for us, the handbook's wisdom ends there, as it goes on to state that "[a]ll known parallel marking algorithms are processor-centric." (page 279)

mknyszek avatar May 02 '25 16:05 mknyszek

Change https://go.dev/cl/658036 mentions this issue: runtime: mark and scan small objects in whole spans [green tea]

gopherbot avatar May 02 '25 17:05 gopherbot

How to try it out

First,

go install golang.org/dl/go1.25rc1@latest

Then build your program with GOEXPERIMENT=greenteagc go1.25rc1 build.

Sending feedback

Please focus your attention on whole programs. Microbenchmarks tend to be poor representatives of garbage collection behavior in real programs. Large changes in the foundations such as these can move the needle for all sorts of reasons, unrelated to (or even inversely related to) how efficient the garbage collector actually is.

If you encounter a situation where Green Tea isn't working out for you, we'd appreciate it if you could share some details with us that would be really helpful moving forward. (This is also helpful if it is working for you.)

  1. The platform you're running on.
  2. The CPU model (or if in the cloud, the VM type) you're running your code on.
  3. Excerpts from stderr of your program running with GODEBUG=gctrace=2, both with and without Green Tea.
  4. A CPU profile of your program running with and without Green Tea.
  5. An execution trace capturing a few complete GC cycles, with and without Green Tea (a few seconds is usually enough).

Note: you can use GOEXPERIMENT=nogreenteagc to explicitly disable Green Tea at build time.

For problems you encounter, please file a new GitHub issue and reference this one, or you can email me directly at mknyszek(at)golang.org if you're not comfortable publishing details or diagnostics publicly.

For successes, post to this issue!

Thank you!

mknyszek avatar May 02 '25 17:05 mknyszek

Change https://go.dev/cl/669655 mentions this issue: main.star: add greenteagc builders

gopherbot avatar May 02 '25 22:05 gopherbot

~No discernible change~ Minor improvement in CPU time for Staticcheck.

CPU: AMD Ryzen 9 3950X 16-Core Processor

> ./staticcheck -debug.version | head -1
staticcheck (devel, v0.7.0-0.dev.0.20250511162810-63832850a84e)

Vanilla:
> GOEXPERIMENT=nogreenteagc go build
> go version ./staticcheck
./staticcheck: go1.25-devel_fac2ccbed3 Thu May 15 08:27:04 2025 -0700
> rm -rf ~/.cache/staticcheck/
> perf stat ./staticcheck -debug.cpuprofile=vanilla.cpu.pprof std

 Performance counter stats for './staticcheck -debug.cpuprofile=vanilla.cpu.pprof std':

        197,121.11 msec task-clock:u                     #   18.932 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           741,678      page-faults:u                    #    3.763 K/sec                     
   738,518,295,442      cycles:u                         #    3.747 GHz                       
   124,817,958,966      stalled-cycles-frontend:u        #   16.90% frontend cycles idle      
   785,988,849,431      instructions:u                   #    1.06  insn per cycle            
                                                  #    0.16  stalled cycles per insn   
   163,480,505,449      branches:u                       #  829.340 M/sec                     
     2,197,680,754      branch-misses:u                  #    1.34% of all branches           

      10.411895545 seconds time elapsed

     182.959316000 seconds user
      10.483651000 seconds sys
> rm -rf ~/.cache/staticcheck
> ./staticcheck -debug.trace=vanilla.trace std # trace collected separately
> rm -rf ~/.cache/staticcheck
> GODEBUG=gctrace=2 ./staticcheck std 2>vanilla.gctrace # GC trace collected separately


Green tea:
> GOEXPERIMENT=greenteagc go build
> go version ./staticcheck
./staticcheck: go1.25-devel_fac2ccbed3 Thu May 15 08:27:04 2025 -0700 X:greenteagc
> rm -rf ~/.cache/staticcheck/
> perf stat ./staticcheck -debug.cpuprofile=greentea.cpu.pprof std
 Performance counter stats for './staticcheck -debug.cpuprofile=greentea.cpu.pprof std':

        193,681.28 msec task-clock:u                     #   18.553 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           782,480      page-faults:u                    #    4.040 K/sec                     
   723,000,663,331      cycles:u                         #    3.733 GHz                       
   130,424,001,536      stalled-cycles-frontend:u        #   18.04% frontend cycles idle      
   788,115,346,312      instructions:u                   #    1.09  insn per cycle            
                                                  #    0.17  stalled cycles per insn   
   158,001,376,147      branches:u                       #  815.780 M/sec                     
     2,432,170,668      branch-misses:u                  #    1.54% of all branches           

      10.439577978 seconds time elapsed

     178.367421000 seconds user
      11.257257000 seconds sys
> rm -rf ~/.cache/staticcheck
> ./staticcheck -debug.trace=greentea.trace std # trace collected separately
> rm -rf ~/.cache/staticcheck
> GODEBUG=gctrace=2 ./staticcheck std 2>greentea.gctrace # GC trace collected separately

The attached files contain the CPU profiles and the gctrace output. I also have execution traces, but they span the whole execution and are ~50 MB each when compressed, which GitHub won't let me attach.

greentea.tar.gz vanilla.tar.gz

perf stat before/after courtesy of normalize-benchfmt:

             │ vanilla.bench │          greentea.bench           │
             │      sec      │    sec      vs base               │
Task-clock:u      199.3 ± 0%   194.4 ± 1%  -2.50% (p=0.000 n=10)
Wall-time         10.11 ± 0%   10.18 ± 1%  +0.74% (p=0.005 n=10)
User-time         184.9 ± 0%   178.9 ± 1%  -3.26% (p=0.000 n=10)
System-time       10.66 ± 2%   11.33 ± 3%  +6.32% (p=0.000 n=10)
geomean           44.63        44.75       +0.25%

                          │ vanilla.bench │            greentea.bench             │
                          │      val      │     val      vs base                  │
Context-switches:u           0.000 ± 0%      0.000 ± 0%        ~ (p=1.000 n=10) ¹
Cpu-migrations:u             0.000 ± 0%      0.000 ± 0%        ~ (p=1.000 n=10) ¹
Page-faults:u               796.2k ± 4%     804.4k ± 4%        ~ (p=0.579 n=10)
Cycles:u                    739.5G ± 0%     719.6G ± 1%   -2.69% (p=0.000 n=10)
Stalled-cycles-frontend:u   125.9G ± 0%     131.1G ± 0%   +4.07% (p=0.000 n=10)
Instructions:u              786.6G ± 0%     789.0G ± 0%   +0.30% (p=0.000 n=10)
Branches:u                  163.6G ± 0%     158.1G ± 0%   -3.36% (p=0.000 n=10)
Branch-misses:u             2.205G ± 0%     2.436G ± 0%  +10.48% (p=0.000 n=10)
geomean                                 ²                 +1.15%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

dominikh avatar May 15 '25 16:05 dominikh

Thanks for sharing! I think there is a win here, though it's not a massive improvement. I think there are some effects that might be masking a bigger win.

At least according to the gctrace I see less overall CPU time spent in any individual GC cycle. On average around 10-15% .

Also, here are the distributions of wall time spent in the mark phase:

$ cat greentea.gctrace | grep ^gc | cut -d ' ' -f 5 | cut -d '+' -f 2 | dist
N 152  sum 3269.07  mean 21.507  gmean 12.3254  std dev 19.7445  variance 389.844

     min 0.28
   1%ile 0.365667
   5%ile 0.9985
  25%ile 4.94167
  median 16.5
  75%ile 29.5833
  95%ile 59.1
  99%ile 82.1533
     max 101

⠀⠀⠀⠀⠀⠀⠀⠀⢀⡴⠒⠒⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.036
⠀⠀⠀⠀⠀⠀⠀⡰⠋⠀⠀⠀⠀⠈⠓⠒⠒⠤⠒⠒⠒⠒⠒⠦⢤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠞⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠙⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠋⠉⠉⠙⠒⠒⠦⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠁
        0                          50                          100
$ cat vanilla.gctrace | grep ^gc | cut -d ' ' -f 5 | cut -d '+' -f 2 | dist
N 139  sum 3502.72  mean 25.1994  gmean 15.1408  std dev 20.7032  variance 428.621

     min 0.28
   1%ile 0.3018
   5%ile 1.13
  25%ile 6.28333
  median 21
  75%ile 36.6667
  95%ile 66.4
  99%ile 80.2733
     max 81

⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠒⠒⠒⠲⢤⣀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.028
⠀⠀⠀⠀⠀⠀⠀⣠⠔⠁⠀⠀⠀⠀⠀⠀⠈⠓⠒⠒⠊⠉⠉⠉⠁⠀⠈⠉⠙⠒⠦⢤⣀⡀⠀⠀⠀⠀⠀⠀⠀⢀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠚⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠉⠉⠉⠁⠈⠉⠉⠉⠉⠉⠉⠉⠉⠒⠒⠒⠒⠒⠒⠢⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
         0

Simultaneously, because each GC cycle is shorter, there's less floating garbage, leading to a smaller live heap. This means for the same amount of allocation work, we're going to have more GC cycles, if GOGC is constant (which it is here). Proof of that is in the fact that with Green Tea there are 152 GC cycles, whereas the original GC only executes 139 cycles. Despite this increase in the number of cycles, the CPU profile suggests that around 10% less time is spent in the garbage collector.

Here are the heap size distributions:

$ cat greentea.gctrace | grep ^gc | cut -d ' ' -f 11 | sed 's/->/\./g' | cut -d '.' -f 2 | dist
N 152  sum 78714  mean 517.855  gmean 295.01  std dev 441.656  variance 195060

     min 3
   1%ile 4.71333
   5%ile 23.85
  25%ile 118.167
  median 425
  75%ile 746.583
  95%ile 1422.05
  99%ile 1576.32
     max 1644

⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠒⠒⠒⠲⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.001
⠀⠀⠀⠀⠀⠀⠀⣠⠔⠁⠀⠀⠀⠀⠀⠀⠙⠒⠦⠤⠤⠖⠒⠋⠉⠉⠉⠓⠒⠒⠒⠲⠤⠤⠤⣄⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠴⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠙⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠚⠉⠉⠉⠉⠉⠉⠙⠒⠒⠢⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
         0                              1000
$ cat vanilla.gctrace | grep ^gc | cut -d ' ' -f 11 | sed 's/->/\./g' | cut -d '.' -f 2 | dist
N 139  sum 77043  mean 554.266  gmean 314.515  std dev 450.016  variance 202514

     min 3
   1%ile 4.45333
   5%ile 22
  25%ile 124.333
  median 504
  75%ile 820.833
  95%ile 1492.1
  99%ile 1602.14
     max 1661

⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠴⠒⠒⠒⠦⣄⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.001
⠀⠀⠀⠀⠀⠀⠀⣠⠞⠁⠀⠀⠀⠀⠀⠈⠑⠦⠤⠤⠤⠖⠋⠉⠀⠀⠀⠈⠉⠉⠙⠒⠦⠤⠤⢤⣀⣀⣀⡀⠀⣀⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠴⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠁⠀⠀⠀⠈⠉⠉⠉⠒⠒⠒⠒⠒⠚⠉⠉⠉⠉⠙⠒⠒⠒⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁

As for why this isn't improving overall end-to-end time much, I suspect that:

  1. The garbage collector is just not on the critical path. Does staticcheck run in parallel or is it mostly a single goroutine?
  2. The increase in the number of GC cycles from a reduction in memory is eroding some of the CPU win.

(I could probably analyze this more efficiently if I just made a tool to parse gctraces... I've been doing quite a lot of this lately. 😅)

mknyszek avatar May 15 '25 16:05 mknyszek

The garbage collector is just not on the critical path. Does staticcheck run in parallel or is it mostly a single goroutine?

The actual analysis phase of Staticcheck tries to use GOMAXPROCS goroutines at once, parallelizing on packages as well as individual analyzers per package.

I've updated my original comment with a proper comparison of before and after perf stat. Maybe that can help shed some more light. The increase in system time surprises me.

dominikh avatar May 15 '25 16:05 dominikh

The increase in system time might be related to the use of a runtime.mutex for the global span list. I tried to swap that out with a lock-free stack (and a lock-free stack of blocks of spans, not coalesced) and it actually scaled significantly worse. With a lock, we can amortize the lock's use over several spans (we try to almost never take just one off the list once we get the lock) whereas a lock-free stack requires an atomic operation for each span we want to take off the list. I'm not quite sure how to confirm this. (It's possibly also a second-order effect from (apparently) having a little less GC pressure. Maybe there's some other system resource that staticcheck relies on that's now hammered slightly harder. Like a kernel lock or something.)

As an aside, there may be a more clever lock-free data structure we could use that would allow us to take many spans off the global list at once, but it's probably going to be quite complicated.

I also checked out the new scan stats I added and based on that, Green Tea's core hypothesis does seem to apply to staticcheck, which is good. A dozen or two objects are scanned on every span scan, which is not bad.

mknyszek avatar May 15 '25 17:05 mknyszek

Hi, will the new GC implement a method to remove/re-add an object completely from the allocated objects list?

Of course may introduce memory leaks if the developer does not take care but would be useful when a large amount of objects are allocated like an in-memory key/value storage where the value is not a slice of bytes.

If I'm right, the experimental arena feature was abandoned.

mxmauro avatar May 29 '25 16:05 mxmauro

Hi, will the new GC implement a method to remove/re-add an object completely from the allocated objects list?

No, that's out of scope. This is just changing the implementation, not adding any user-visible APIs.

Of course may introduce memory leaks if the developer does not take care but would be useful when a large amount of objects are allocated like an in-memory key/value storage where the value is not a slice of bytes.

If the value contains any pointers, the garbage collector must observe it to prevent a bad free. It's worse than just a memory leak in that case, it's potential memory corruption.

If I'm right, the experimental arena feature was abandoned.

You may be interested in the discussion of memory regions, which is more likely to be the successor of the arenas experiment. See https://github.com/golang/go/discussions/70257.

mknyszek avatar May 29 '25 17:05 mknyszek

With the release of Go 1.25 RC1, it should be substantially easier to try out Green Tea, since a tip toolchain is no longer needed. I updated the comment containing the instructions above.

As before, please try it out!

mknyszek avatar Jun 11 '25 19:06 mknyszek

@mknyszek

The new instructions are not working for me. If I just set GOTOOLCHAIN and then run go version, I get go version go1.25rc1 linux/amd64, but if I set GOEXPERIMENT=greenteagc, I get

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x3 pc=0x81a081]

goroutine 1 [running]:
cmd/go/internal/fips140.Init()
	cmd/go/internal/fips140/fips140.go:117 +0x101
cmd/go/internal/modload.Init()
	cmd/go/internal/modload/init.go:419 +0x33
cmd/go/internal/toolchain.Exec({0xc0000261bc, 0x9})
	cmd/go/internal/toolchain/select.go:356 +0x3fc
cmd/go/internal/toolchain.Select()
	cmd/go/internal/toolchain/select.go:284 +0x11a5
main.main()
	cmd/go/main.go:107 +0x4f

This doesn't seem to be something specific to GOEXPERIMENT=greenteagc. I also get the exact same crash with GOEXPERIMENT=jsonv2. The go command just crashes immediately regardless of arguments. The value of GOTOOLCHAIN doesn't seem to matter, either, as long as it's a valid version. If GOTOOLCHAIN is set to a valid version and GOEXPERIMENT is set to anything at all, running the go command instantly crashes with the above.

For context, I'm on Arch and have Go installed via the go package.

DeedleFake avatar Jun 11 '25 20:06 DeedleFake

Oops, that looks like a bug in cmd/go (in your installed version, not 1.25rc1) where it crashes if there is a GOEXPERIMENT the installed version doesn't know about before it switches to the new toolchain. I'll file an issue shortly.

As a workaround, it should work to go install golang.org/dl/go1.25rc1@latest and then use GOEXPERIMENT=greentea go1.25rc1 build.

prattmic avatar Jun 11 '25 20:06 prattmic

Filed #74111.

prattmic avatar Jun 11 '25 20:06 prattmic

Ugh, my apologies. I ran the command in the instructions myself, but with a toolchain that understood the GOEXPERIMENT. I will fix up the instructions. Thanks for reporting so promptly! Thank you also to @prattmic for quickly filing a bug and figuring out the problem.

mknyszek avatar Jun 11 '25 21:06 mknyszek

Hello everyone, when do you think this gc can be officially put into golang as a formal prod environment? @mknyszek

introspection3 avatar Jun 13 '25 06:06 introspection3

Our prototype focuses exclusively on “small object spans”, which are exactly 8 KiB and contain objects up to 512 bytes.

I have tested this new collector on a case with "big objects" that shows a significant regression, so I'm not sure if this is something that's known and being worked on, but I thought I'd report it just in case.

many-big-json-in-assets is a Hugo project I created a while ago to test a new "dynamic LRU cache" implementation.[^1] It basically reads many big JSON files into a Go map. Running with go1.25rc1:

Output below comes from a MacBook M1 Pro 32GB.

greenteagc disabled
~/d/s/h/many-big-json-in-assets ❯❯❯ nohugogreenteagc env                                                               main
hugo v0.148.0-DEV-4217fee4b0455dac0c711708afb27c6506fa3339 darwin/arm64 BuildDate=2025-06-09T09:44:13Z
GOOS="darwin"
GOARCH="arm64"
GOVERSION="go1.25rc1"
github.com/sass/dart-sass/protocol="3.1.0"
github.com/sass/dart-sass/compiler="1.81.0"
github.com/sass/dart-sass/implementation="1.81.0"

~/d/s/h/many-big-json-in-assets ❯❯❯ time HUGO_MEMORYLIMIT=8 nohugogreenteagc --logLevel info       main ⬆ ⬇ ✖ ✱ ◼
Start building sites …
hugo v0.148.0-DEV-4217fee4b0455dac0c711708afb27c6506fa3339 darwin/arm64 BuildDate=2025-06-09T09:44:13Z

INFO  build:  step process substep collect files 2000 files_total 2000 pages_total 2000 resources_total 0 duration 65.761333ms
INFO  build:  step process duration 65.8265ms
INFO  build:  step assemble duration 24.065ms
INFO  dynacache: adjusted partitions' max size evicted 956 numGC 22 limit 8.00 GB alloc 13.56 GB totalAlloc 18.40 GB
INFO  dynacache: adjusted partitions' max size evicted 4171 numGC 22 limit 8.00 GB alloc 18.25 GB totalAlloc 25.33 GB
INFO  build:  step render substep pages site en outputFormat html duration 36.802299917s
INFO  build:  step render substep pages site en outputFormat rss duration 226.391375ms
INFO  build:  step render pages 2003 content 2000 duration 37.032966792s
INFO  build:  step render deferred count 0 duration 1.792µs
INFO  build:  step postProcess duration 212.333µs
INFO  build:  duration 37.123888042s

                  │  EN
──────────────────┼──────
 Pages            │ 2003
 Paginator pages  │    0
 Non-page files   │    0
 Static files     │    0
 Processed images │    0
 Aliases          │    0
 Cleaned          │    0

Total in 37141 ms
HUGO_MEMORYLIMIT=8 nohugogreenteagc --logLevel info  156.74s user 63.40s system 567% cpu 38.792 total
greenteagc enabled
 hugogreenteagc  env                                                                main
hugo v0.148.0-DEV-4217fee4b0455dac0c711708afb27c6506fa3339 darwin/arm64 BuildDate=2025-06-09T09:44:13Z
GOOS="darwin"
GOARCH="arm64"
GOVERSION="go1.25rc1 X:greenteagc"
github.com/sass/dart-sass/protocol="3.1.0"
github.com/sass/dart-sass/compiler="1.81.0"
github.com/sass/dart-sass/implementation="1.81.0"

~/d/s/h/many-big-json-in-assets ❯❯❯ time HUGO_MEMORYLIMIT=8 hugogreenteagc --logLevel info                   main
Start building sites …
hugo v0.148.0-DEV-4217fee4b0455dac0c711708afb27c6506fa3339 darwin/arm64 BuildDate=2025-06-09T09:44:13Z

INFO  build:  step process substep collect files 2000 files_total 2000 pages_total 2000 resources_total 0 duration 70.020208ms
INFO  build:  step process duration 70.332041ms
INFO  build:  step assemble duration 25.223333ms
INFO  dynacache: adjusted partitions' max size evicted 956 numGC 22 limit 8.00 GB alloc 13.48 GB totalAlloc 21.83 GB
INFO  dynacache: adjusted partitions' max size evicted 4563 numGC 22 limit 8.00 GB alloc 20.40 GB totalAlloc 28.81 GB
INFO  dynacache: adjusted partitions' max size evicted 316 numGC 26 limit 8.00 GB alloc 8.62 GB totalAlloc 65.96 GB
INFO  build:  step render substep pages site en outputFormat html duration 1m9.976262208s
INFO  build:  step render substep pages site en outputFormat rss duration 485.931166ms
INFO  build:  step render pages 2003 content 2000 duration 1m10.465748584s
INFO  build:  step render deferred count 0 duration 2.167µs
INFO  build:  step postProcess duration 38.583µs
INFO  build:  duration 1m10.56192575s

                  │  EN
──────────────────┼──────
 Pages            │ 2003
 Paginator pages  │    0
 Non-page files   │    0
 Static files     │    0
 Processed images │    0
 Aliases          │    0
 Cleaned          │    0

Total in 70586 ms
HUGO_MEMORYLIMIT=8 hugogreenteagc --logLevel info  146.87s user 171.52s system 438% cpu 1:12.63 total

There are instructions on how to run this yourself in many-big-json-in-assets.

Also note that Hugo has some hidden flags that may be useful in this context:

  • --profile-cpu=<filename>
  • --profile-mem=<filename>
  • --profile-mutex=<filename>

[^1]: It resizes the cache in flight when low on memory.

bep avatar Jun 13 '25 07:06 bep

@bep A regression that large is very unlikely to be the result of the garbage collector becoming wildly slow. Plus, the large object scanning path hasn't changed at all, so if your heap is mostly dominated by large pointer-ful things, it won't make a difference. ~The memory limit bits here seem fishy to me. "alloc" in your reported numbers is much higher than limit both before and after. I do not know exactly what that number is referring to (this is one reason I requested a specific list of diagnostics in the instructions, since I know what all the numbers mean in those diagnostics) but if it's the allocated heap size, then I would argue this is a misconfigured program. Green Tea might use slightly more memory for small objects, and it's possible the change is just pushing you further into "the GC is running continuously because the memory limit is too low" territory.~ EDIT: AFAICT, the memory limit doesn't have anything to do with GOMEMLIMIT, so my assumption here was wrong. Apologies for the noise.

I'll take a look, but at face value, I would not extrapolate the performance of Green Tea from this example alone.

Thank you for taking the time to try it out! I'll post a more detailed analysis soon.

mknyszek avatar Jun 13 '25 16:06 mknyszek

Hello everyone, when do you think this gc can be officially put into golang as a formal prod environment? @mknyszek

It's an experiment that's part of an upcoming official release (Go 1.25). If there are no significant blockers, the goal is for it to become the default in Go 1.26.

If you're wondering whether it's solid enough to try in production, I would say yes, it is, and I encourage teams to try it out. It has been running smoothly in CI for over a month now and has been tested against Google's extensive internal test suite without any complicating or worrying factors.

mknyszek avatar Jun 13 '25 16:06 mknyszek

@bep A regression that large is very unlikely to be the result of the garbage collector becoming wildly slow. Plus, the large object scanning path hasn't changed at all, so if your heap is mostly dominated by large pointer-ful things, it won't make a difference. ~The memory limit bits here seem fishy to me. "alloc" in your reported numbers is much higher than limit both before and after. I do not know exactly what that number is referring to (this is one reason I requested a specific list of diagnostics in the instructions, since I know what all the numbers mean in those diagnostics) but if it's the allocated heap size, then I would argue this is a misconfigured program. Green Tea might use slightly more memory for small objects, and it's possible the change is just pushing you further into "the GC is running continuously because the memory limit is too low" territory.~ EDIT: AFAICT, the memory limit doesn't have anything to do with GOMEMLIMIT, so my assumption here was wrong. Apologies for the noise.

I'll take a look, but at face value, I would not extrapolate the performance of Green Tea from this example alone.

Thank you for taking the time to try it out! I'll post a more detailed analysis soon.

Yeah OK, so the GC itself does seem to be faster, but there's a regression with, AFAICT, how much zeroing is performed? I'm still digging into it, but this may just be a small mistake. I'll see where it leads. EDIT: See the next comment for details. It's probably not a mistake causing things to get zeroed when they shouldn't.

Again, thank you @bep for reporting! This feedback is very helpful. Apologies for my initial reply being a bit blunt -- I'd immediately entered 'analysis mode'. 😅

mknyszek avatar Jun 13 '25 16:06 mknyszek

@bep Here's some more data.

gctrace without Green Tea:

gc 1 @0.005s 0%: 0.016+0.59+0.042 ms clock, 1.0+0.11/0.89/0+2.7 ms cpu, 4->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 2 @0.007s 1%: 0.017+0.71+0.099 ms clock, 1.1+0.10/0.99/0+6.3 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 3 @0.011s 2%: 0.037+0.86+0.14 ms clock, 2.4+0.14/1.2/0.010+9.2 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 4 @0.018s 2%: 0.051+0.90+0.096 ms clock, 3.2+0.12/1.6/0.028+6.1 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 5 @0.022s 2%: 0.042+0.91+0.079 ms clock, 2.7+0.083/1.9/0.001+5.0 ms cpu, 3->3->2 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 6 @0.029s 2%: 0.26+0.95+0.064 ms clock, 16+0.082/3.2/0+4.0 ms cpu, 5->6->4 MB, 5 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 7 @0.041s 2%: 0.044+1.2+0.18 ms clock, 2.8+0.11/8.6/0+11 ms cpu, 8->8->6 MB, 9 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 8 @0.063s 2%: 0.33+2.6+0.23 ms clock, 21+0.063/15/0+14 ms cpu, 13->14->8 MB, 13 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 9 @0.072s 3%: 0.43+5.0+0.39 ms clock, 27+1.6/34/0+25 ms cpu, 17->19->12 MB, 18 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 10 @0.086s 5%: 0.89+3.9+0.38 ms clock, 57+2.9/46/9.6+24 ms cpu, 23->24->15 MB, 26 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 11 @0.132s 4%: 0.058+4.3+0.11 ms clock, 3.7+0.13/56/39+7.1 ms cpu, 28->28->20 MB, 32 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 12 @0.155s 5%: 2.4+4.9+0.023 ms clock, 157+14/50/1.8+1.4 ms cpu, 37->48->35 MB, 42 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 13 @0.166s 6%: 1.6+5.1+0.032 ms clock, 105+28/52/0.58+2.0 ms cpu, 61->77->49 MB, 71 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 14 @0.176s 6%: 0.83+10+0.29 ms clock, 53+28/42/0.14+18 ms cpu, 92->171->128 MB, 107 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 15 @0.192s 7%: 1.6+5.9+0.38 ms clock, 104+26/50/0.013+24 ms cpu, 219->281->141 MB, 257 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 16 @0.204s 7%: 1.5+9.1+0.92 ms clock, 97+20/44/0+58 ms cpu, 251->378->249 MB, 287 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 17 @0.224s 8%: 2.0+17+2.0 ms clock, 132+0.31/99/0.76+131 ms cpu, 428->571->321 MB, 500 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 18 @0.262s 8%: 1.5+37+2.2 ms clock, 97+0/241/9.9+141 ms cpu, 549->690->455 MB, 645 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 19 @0.354s 9%: 1.8+61+2.7 ms clock, 120+0/882/0.20+177 ms cpu, 776->1137->1057 MB, 912 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 20 @0.529s 10%: 1.5+189+3.9 ms clock, 98+77/1935/34+255 ms cpu, 1815->2925->2399 MB, 2115 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 21 @0.955s 12%: 2.0+282+2.8 ms clock, 133+489/4364/0+179 ms cpu, 4088->5595->4131 MB, 4800 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 22 @1.621s 14%: 1.9+441+2.9 ms clock, 122+2267/6687/0+188 ms cpu, 7039->9565->7301 MB, 8265 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 23 @2.791s 15%: 1.8+816+2.4 ms clock, 120+4808/11935/17+157 ms cpu, 12435->16992->13070 MB, 14603 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 24 @4.861s 16%: 2.7+1088+0.22 ms clock, 174+10203/17338/0.23+14 ms cpu, 22263->28569->20030 MB, 26141 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 25 @7.744s 13%: 1.7+212+0.10 ms clock, 114+2611/3379/0.77+6.8 ms cpu, 34117->35359->4336 MB, 40061 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 26 @10.501s 11%: 1.2+196+0.099 ms clock, 78+4169/3136/3.6+6.3 ms cpu, 7387->8689->4425 MB, 8674 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 27 @10.988s 11%: 0.85+201+0.22 ms clock, 54+3789/3204/21+14 ms cpu, 7538->8788->4224 MB, 8852 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 28 @11.507s 12%: 1.8+191+0.18 ms clock, 115+3141/3041/0.31+11 ms cpu, 7204->8413->4109 MB, 8450 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 29 @12.011s 12%: 0.97+216+0.21 ms clock, 62+3827/3380/40+13 ms cpu, 7001->8312->4391 MB, 8220 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 30 @12.501s 12%: 0.62+291+0.61 ms clock, 40+2884/4629/3212+39 ms cpu, 7481->8996->6304 MB, 8783 MB goal, 0 MB stacks, 0 MB globals, 64 P

gctrace with Green Tea:

gc 1 @0.006s 1%: 0.095+1.1+0.074 ms clock, 6.1+0.12/1.3/0+4.7 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 2 @0.008s 3%: 0.016+1.0+0.23 ms clock, 1.0+0.14/0.95/0+14 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 3 @0.013s 3%: 0.044+0.67+0.083 ms clock, 2.8+0.14/0.98/0+5.3 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 4 @0.017s 3%: 0.050+0.61+0.10 ms clock, 3.2+0.18/1.3/0+6.5 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 5 @0.023s 3%: 0.039+0.73+0.24 ms clock, 2.5+0.15/1.7/0+15 ms cpu, 4->4->2 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 6 @0.030s 3%: 0.047+1.4+0.073 ms clock, 3.0+0.19/7.8/0+4.6 ms cpu, 4->4->3 MB, 5 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 7 @0.036s 3%: 0.030+0.80+0.070 ms clock, 1.9+0.11/3.8/0+4.4 ms cpu, 6->6->5 MB, 7 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 8 @0.062s 2%: 0.27+2.9+0.46 ms clock, 17+0.19/8.5/0+29 ms cpu, 10->12->7 MB, 10 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 9 @0.070s 3%: 0.12+2.1+0.57 ms clock, 8.1+0.025/11/0+36 ms cpu, 14->15->9 MB, 16 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 10 @0.079s 4%: 0.60+2.5+0.22 ms clock, 38+0.63/28/0+14 ms cpu, 17->19->12 MB, 20 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 11 @0.088s 5%: 0.52+4.4+1.0 ms clock, 33+0.23/24/0+67 ms cpu, 22->25->16 MB, 26 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 12 @0.131s 4%: 0.060+2.3+0.10 ms clock, 3.8+0.10/25/11+6.4 ms cpu, 30->30->21 MB, 34 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 13 @0.149s 4%: 2.5+8.1+0.091 ms clock, 160+0.30/44/0+5.8 ms cpu, 37->53->41 MB, 43 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 14 @0.171s 5%: 4.3+10+0.086 ms clock, 275+20/34/0+5.5 ms cpu, 74->85->50 MB, 84 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 15 @0.190s 6%: 0.43+3.2+0.36 ms clock, 27+16/30/1.0+23 ms cpu, 99->102->62 MB, 102 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 16 @0.197s 6%: 0.96+3.6+0.10 ms clock, 61+19/35/0+6.5 ms cpu, 110->136->89 MB, 134 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 17 @0.206s 7%: 1.5+8.0+0.37 ms clock, 101+19/39/0+23 ms cpu, 167->259->175 MB, 204 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 18 @0.222s 7%: 1.7+11+0.75 ms clock, 111+7.0/42/0+48 ms cpu, 299->422->248 MB, 352 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 19 @0.242s 7%: 2.4+9.8+2.6 ms clock, 156+0.029/42/0+171 ms cpu, 425->521->282 MB, 497 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 20 @0.272s 8%: 2.6+12+2.7 ms clock, 170+0.014/93/0+177 ms cpu, 482->527->328 MB, 566 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 21 @0.331s 7%: 1.5+81+1.8 ms clock, 100+0.35/301/0+120 ms cpu, 560->690->631 MB, 658 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 22 @0.667s 5%: 1.5+140+3.5 ms clock, 98+0.048/754/0+224 ms cpu, 1077->1333->1209 MB, 1264 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 23 @1.129s 5%: 9.8+242+5.9 ms clock, 632+10/1238/0+379 ms cpu, 2063->2754->2080 MB, 2421 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 24 @1.859s 5%: 8.2+381+6.8 ms clock, 529+15/2496/0+440 ms cpu, 3544->4476->3397 MB, 4162 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 25 @2.962s 5%: 5.5+369+4.7 ms clock, 353+33/3873/0+305 ms cpu, 5788->6526->4705 MB, 6796 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 26 @4.275s 6%: 4.6+510+5.8 ms clock, 298+21/5909/0+374 ms cpu, 8015->9084->7069 MB, 9412 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 27 @6.413s 6%: 2.7+748+4.5 ms clock, 178+26/9085/20+289 ms cpu, 12041->13439->10214 MB, 14139 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 28 @9.330s 6%: 6.6+967+2.5 ms clock, 424+50/13317/0+163 ms cpu, 17397->19249->14845 MB, 20429 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 29 @13.734s 5%: 3.4+365+5.4 ms clock, 217+50/2733/97+348 ms cpu, 25288->26017->3886 MB, 29691 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 30 @16.489s 4%: 1.0+106+0.51 ms clock, 68+2295/1667/22+33 ms cpu, 7028->7807->4092 MB, 7774 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 31 @17.009s 5%: 0.99+221+3.0 ms clock, 63+2599/3459/20+196 ms cpu, 6972->9066->7114 MB, 8186 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 32 @17.588s 5%: 1.1+424+3.0 ms clock, 76+4020/6684/16+193 ms cpu, 12121->15980->12158 MB, 14230 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 33 @18.830s 5%: 0.54+151+1.2 ms clock, 34+1427/2347/15+82 ms cpu, 20709->22037->4507 MB, 24318 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 34 @20.708s 5%: 1.2+133+1.1 ms clock, 80+1787/2094/28+73 ms cpu, 7677->9013->4484 MB, 9016 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 35 @21.291s 5%: 0.65+124+0.095 ms clock, 41+56/1972/3223+6.0 ms cpu, 7639->8056->3817 MB, 8971 MB goal, 0 MB stacks, 0 MB globals, 64 P

CPU profile without Green Tea:

Showing nodes accounting for 145.41s, 39.94% of 364.05s total
Dropped 732 nodes (cum <= 1.82s)
Showing top 10 nodes out of 150
      flat  flat%   sum%        cum   cum%
    29.16s  8.01%  8.01%     81.06s 22.27%  runtime.scanobject
    23.63s  6.49% 14.50%     46.89s 12.88%  encoding/json.checkValid
    19.05s  5.23% 19.73%     19.05s  5.23%  runtime.memclrNoHeapPointers
    15.83s  4.35% 24.08%     27.45s  7.54%  runtime.findObject
    10.67s  2.93% 27.01%     17.77s  4.88%  encoding/json.(*decodeState).rescanLiteral
    10.41s  2.86% 29.87%     11.08s  3.04%  runtime.spanOf (inline)
    10.12s  2.78% 32.65%     10.20s  2.80%  encoding/json.unquoteBytes
     9.20s  2.53% 35.18%     10.74s  2.95%  encoding/json.stateInString
     8.77s  2.41% 37.59%     15.18s  4.17%  encoding/json.stateEndValue
     8.57s  2.35% 39.94%    156.15s 42.89%  runtime.mallocgc

CPU profile with Green Tea:

Showing nodes accounting for 560.86s, 69.72% of 804.48s total
Dropped 844 nodes (cum <= 4.02s)
Showing top 10 nodes out of 124
      flat  flat%   sum%        cum   cum%
   306.87s 38.15% 38.15%    306.87s 38.15%  runtime.memclrNoHeapPointers
    76.30s  9.48% 47.63%     76.30s  9.48%  runtime.memmove
    29.34s  3.65% 51.28%     41.06s  5.10%  runtime.tryDeferToSpanScan
    26.94s  3.35% 54.63%     59.69s  7.42%  runtime.mallocgcTiny
    25.38s  3.15% 57.78%     60.28s  7.49%  runtime.scanObjectsSmall
    24.94s  3.10% 60.88%     48.96s  6.09%  encoding/json.checkValid
    24.16s  3.00% 63.88%     24.16s  3.00%  runtime.duffzero
    20.42s  2.54% 66.42%     20.42s  2.54%  internal/runtime/atomic.(*Uint64).CompareAndSwap (inline)
    14.98s  1.86% 68.28%     47.01s  5.84%  runtime.(*spanSet).pop
    11.53s  1.43% 69.72%     18.43s  2.29%  encoding/json.(*decodeState).rescanLiteral

Notes:

  • The peak memory usage characteristics are wildly different, before and after. It's possible this is the root cause of the additional zeroing. With less of the heap being freshly allocated from the operating system, we're reusing memory better, but this also means much more zeroing.
  • Most of the zeroing comes from zeroing the heap bits area of new small object spans and from zeroing the objects themselves.
  • With Green Tea, the GC overheads seem far more stable, but there are more GC cycles. The number of cycles actually spent on garbage collection seems a lot lower, too.
  • One thing that comes to mind is how the change in GC behavior is affecting the caching policy that hugo implements. If it's watching runtime metrics to make decisions, then a faster GC (leading to less floating garbage, hnce less of a ballast, etc. etc.) could lead to very different behavior, as the GC and the caching behavior effectively operate in a mutual feedback loop. Like, if hugo is much more readily dropping cache entries, that could easily lead to a lot more zeroing.
  • I compared memory profiles and could not find a significant difference in total amount allocated, which runs counter to the theory that cache entries are getting dropped more frequently.

mknyszek avatar Jun 13 '25 16:06 mknyszek

If it's watching runtime metrics to make decisions, then a faster GC (leading to less floating garbage, hnce less of a ballast, etc. etc.) could lead to very different behavior, as the GC and the caching behavior effectively operate in a mutual feedback loop. Like, if hugo is much more readily dropping cache entries, that could easily lead to a lot more zeroing.

@mknyszek thanks for looking into this. So, unmarshalling lots of JSON files into maps and then handing them over to the Go GC does not sound very uncommon, but I guess what you're saying above is that Hugo is doing this handover/zeroing in bulks (cache expiry), which I guess is less common (but not very uncommon).

bep avatar Jun 13 '25 20:06 bep

If it's watching runtime metrics to make decisions, then a faster GC (leading to less floating garbage, hnce less of a ballast, etc. etc.) could lead to very different behavior, as the GC and the caching behavior effectively operate in a mutual feedback loop. Like, if hugo is much more readily dropping cache entries, that could easily lead to a lot more zeroing.

@mknyszek thanks for looking into this. So, unmarshalling lots of JSON files into maps and then handing them over to the Go GC does not sound very uncommon, but I guess what you're saying above is that Hugo is doing this handover/zeroing in bulks (cache expiry), which I guess is less common (but not very uncommon).

No, I don't think unmarshalling lots of JSON files into maps is very relevant here. Even if it's happening in bulk, or all the memory is getting dropped in bulk. The GC is especially fine with dealing with the latter, with or without Green Tea (dead memory is very cheap). Yes, lots of applications do this, and ones that do do just fine with Green Tea. In fact, looking purely at the cost of the garbage collector, there's a substantial reduction in GC overheads here with Green Tea, and the GODEBUG=gctrace=2 output suggests that the Green Tea design is very effective at batching objects in this workload (80 16-byte objects scanned per span on average, 223 24-byte objects per span on average, 124 48-byte objects per span on average -- these are all great). EDIT: I also confirmed that, apples-to-apples, the same GC cycle takes around 20% less CPU time for a heap of this workload. So it's almost definitely not directly the GC.

This slowdown on the end-to-end wall time is likely a second or third order effect.

I don't think I have a conclusion yet, but what I'm trying to say with this line you quoted is that what's unusual about Hugo is that it's making decisions about liveness based on heap size metrics. The heap size is already the culmination of a feedback loop consisting of many parameters that is based on the behavior of previous cycles, and so on. This is complex in itself but is at least something I'm familiar with. For me, predicting exactly what's going to happen with GC scheduling when a program makes liveness decisions based on heap metrics is tougher. The culmination of these effects seems to be that the current GC has a ~10 GiB larger peak heap size on the same workload compared to Green Tea for this workload. My current best guess is that the ~10 GiB peak heap size difference is 10 more GiB of allocated memory that didn't come fresh from the operating system and needs to be zeroed. But that also doesn't seem like a complete answer. (EDIT: This is the point I was trying to make, but it was just a guess, and it was wrong. See below.)

mknyszek avatar Jun 13 '25 21:06 mknyszek

I've essentially ruled out the "more memory is zeroed" hypothesis. I patched in some code to count the number of zeroed bytes, and the difference is <1 GiB, or only single-digit percent.

My next lead is that sys time is up by an order of magnitude, and how much it's up (~350 CPU seconds) corresponds fairly closely with the regression in zeroing in the CPU profile. My first suspicion was that something regressed with returning memory back to the OS forcing page faults, but no, it's not that. Even if I completely disable that, nothing changes. Plus, perf stat is telling me that the Green Tea version incurs half the page faults as without Green Tea...

Still investigating.

mknyszek avatar Jun 16 '25 18:06 mknyszek

If it's watching runtime metrics to make decisions, then a faster GC (leading to less floating garbage, hnce less of a ballast, etc. etc.) could lead to very different behavior, as the GC and the caching behavior effectively operate in a mutual feedback loop. Like, if hugo is much more readily dropping cache entries, that could easily lead to a lot more zeroing.

This was just a guess, but in addition to my comment above, another thing that makes this irrelevant is I can just set HUGO_MEMORYLIMIT very high and reproduce the same slowdown. So the cache eviction behavior doesn't actually matter.

mknyszek avatar Jun 16 '25 20:06 mknyszek

I have to put this aside for today, but so far I've learned through Linux perf (via the cycles event) that the extra time unsurprisingly is coming from time spent in the kernel, which we end up in via memclrNoHeapPointers. Unfortunately, I do not have access to a machine or VM at the moment that will easily let me symbolize kernel addresses... So far I've been able to rule out the following perf events:

  • minor-faults
  • major-faults
  • page-faults
  • dTLB-*
  • iTLB-*
  • L1-icache-*

I may need to double-check my work and make sure I didn't misread the results on any of them.

I was concerned that the root cause was dTLB-store-related issues (an attempt to store but the page is not in the TLB) but that doesn't seem to be the case... maybe.

mknyszek avatar Jun 16 '25 21:06 mknyszek

I managed to symbolize kernel symbols on a different machine. These are task-clock based samples, so a little less accurate, but if it's correct, it's more information than before.

-   30.56%     0.71%  hugo     hugo                  [.] runtime.memclrNoHeapPointers                                                             
   - 29.84% runtime.memclrNoHeapPointers                                                                                                          
      - 29.84% asm_exc_page_fault                                                                                                                 
         - 29.83% exc_page_fault                                                                                                                  
            - 29.83% do_user_addr_fault                                                                                                           
               - 29.14% handle_mm_fault                                                                                                           
                  - 29.06% __handle_mm_fault                                                                                                      
                     - 15.12% do_wp_page                                                                                                          
                        - 7.96% ptep_clear_flush                                                                                                  
                           - flush_tlb_mm_range                                                                                                   
                              - 7.90% on_each_cpu_cond_mask                                                                                       
                                 - 7.90% smp_call_function_many_cond                                                                              
                                      3.38% __x2apic_send_IPI_mask                                                                                
                        - 6.45% __pte_offset_map_lock                                                                                             
                           - 6.43% _raw_spin_lock                                                                                                 
                                6.41% __pv_queued_spin_lock_slowpath                                                                              
                     - 10.24% _raw_spin_lock                                                                                                      
                          10.21% __pv_queued_spin_lock_slowpath                                                                                   
                     + 2.52% do_anonymous_page                                                                                                    
                     + 1.01% do_huge_pmd_anonymous_page

Except I still don't really understand why Green Tea would cause more page faults, or more contention in the page fault handler. It is true that, especially without HUGO_MEMORYLIMIT this program's heap is growing monotonically (up to around 50 GiB), so one would expect most of the new allocations to force faulting in new memory.

... but then we also shouldn't be faulting in new memory on memclrNoHeapPointers. I have plenty of RAM on all the machines I'm testing on, so we shouldn't be hitting swap. The only page fault should be on first access to fresh memory, which we skip zeroing for. (It makes sense that it might happen from initHeapBits, though, which does touch fresh memory. We should consider not doing that...)

Also, I did indeed observe fewer page faults for GOMAXPROCS=64, but there are 1.5x-2x the page faults when I set GOMAXPROCS=16. On the other hand, the regression and the fact that the spin lock in the kernel is a major part of the regression appears to be the case either way. There doesn't necessarily have to be more page faults in any case. It seems to me that each page fault is more expensive because this spin lock is a scalability bottleneck.

My "optimistic" hypothesis about this issue is that because Green Tea is faster and scales better, it's uncovered a different scalability bottleneck, and performance is collapsing on that bottleneck. I'm not yet 100% sold on this as the answer, but one piece of evidence to support this is the fact that there's less time spent in GC assists with Green Tea (which makes sense, if the GC completes faster, goroutines don't need to assist as much). This is supported by the gctraces from earlier, but especially so when HUGO_MEMORYLIMIT is effectively disabled (100):

Without Green Tea:

gc 1 @0.005s 1%: 0.052+1.1+0.12 ms clock, 3.3+0.093/0.92/0+7.7 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 2 @0.008s 1%: 0.034+0.76+0.022 ms clock, 2.2+0.10/0.91/0+1.4 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 3 @0.013s 2%: 0.034+0.64+0.12 ms clock, 2.1+0.11/1.2/0.019+8.1 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 4 @0.019s 2%: 0.032+0.68+0.099 ms clock, 2.0+0.11/1.8/0+6.3 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 5 @0.022s 2%: 0.026+0.85+0.10 ms clock, 1.6+0.17/2.9/0.050+6.6 ms cpu, 3->3->2 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 6 @0.030s 2%: 0.038+0.84+0.12 ms clock, 2.4+0.078/4.2/0.009+8.2 ms cpu, 5->5->3 MB, 5 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 7 @0.040s 2%: 0.042+1.6+0.067 ms clock, 2.7+0.092/12/0.002+4.3 ms cpu, 6->6->5 MB, 7 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 8 @0.060s 3%: 0.75+3.3+0.067 ms clock, 48+0.015/18/0+4.3 ms cpu, 10->12->7 MB, 11 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 9 @0.068s 4%: 0.20+3.9+1.2 ms clock, 13+0.072/20/0+77 ms cpu, 14->16->10 MB, 16 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 10 @0.078s 5%: 0.61+5.1+0.23 ms clock, 39+2.6/43/0+15 ms cpu, 18->23->14 MB, 21 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 11 @0.102s 5%: 0.060+3.6+0.030 ms clock, 3.8+0.041/49/33+1.9 ms cpu, 26->26->16 MB, 30 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 12 @0.138s 4%: 0.20+9.0+0.027 ms clock, 13+5.2/62/0+1.7 ms cpu, 29->44->32 MB, 34 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 13 @0.153s 6%: 3.0+9.3+0.041 ms clock, 193+27/44/0+2.6 ms cpu, 60->66->41 MB, 65 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 14 @0.175s 8%: 5.6+5.1+0.37 ms clock, 362+27/55/0.090+24 ms cpu, 72->89->57 MB, 85 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 15 @0.189s 9%: 1.0+6.0+0.086 ms clock, 68+33/50/0.030+5.5 ms cpu, 114->149->104 MB, 128 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 16 @0.200s 9%: 1.5+7.6+0.25 ms clock, 99+27/59/0.21+16 ms cpu, 179->258->156 MB, 211 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 17 @0.213s 10%: 1.3+9.2+0.71 ms clock, 87+29/70/0.74+45 ms cpu, 269->371->221 MB, 314 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 18 @0.230s 10%: 2.0+10+2.1 ms clock, 130+0/63/0+135 ms cpu, 378->472->262 MB, 444 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 19 @0.259s 9%: 2.0+65+2.5 ms clock, 131+0.066/142/0+162 ms cpu, 450->691->488 MB, 527 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 20 @0.378s 10%: 1.3+63+3.1 ms clock, 88+0.079/885/0.62+200 ms cpu, 835->1175->1034 MB, 978 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 21 @0.573s 11%: 4.3+154+2.7 ms clock, 278+347/1997/0.86+174 ms cpu, 1765->2622->2140 MB, 2070 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 22 @0.950s 12%: 1.6+255+3.0 ms clock, 105+995/3283/0.042+193 ms cpu, 3646->5147->3892 MB, 4282 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 23 @1.589s 14%: 3.0+388+1.2 ms clock, 197+2139/6066/0+80 ms cpu, 6630->8981->6769 MB, 7785 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 24 @2.678s 15%: 2.5+667+2.1 ms clock, 163+4783/10560/0+137 ms cpu, 11531->15466->11846 MB, 13539 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 25 @4.575s 16%: 2.4+1175+3.4 ms clock, 153+8633/18393/0+218 ms cpu, 20177->26772->20447 MB, 23694 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 26 @7.822s 16%: 3.1+2216+1.6 ms clock, 200+12146/35272/0+108 ms cpu, 34825->45998->35265 MB, 40895 MB goal, 0 MB stacks, 0 MB globals, 64 P

With Green Tea:

gc 1 @0.006s 1%: 0.10+1.0+0.095 ms clock, 6.4+0.10/1.2/0+6.1 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 2 @0.009s 3%: 0.031+0.69+0.23 ms clock, 2.0+0.16/1.1/0+15 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 3 @0.013s 3%: 0.038+0.85+0.12 ms clock, 2.4+0.13/0.98/0+8.2 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 4 @0.018s 4%: 0.038+0.72+0.23 ms clock, 2.4+0.15/1.1/0+15 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 5 @0.024s 3%: 0.051+0.83+0.097 ms clock, 3.3+0.15/1.5/0+6.2 ms cpu, 4->4->2 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 6 @0.031s 3%: 0.039+1.0+0.055 ms clock, 2.5+0.75/4.1/0.066+3.5 ms cpu, 5->5->3 MB, 5 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 7 @0.042s 2%: 0.048+1.6+0.079 ms clock, 3.1+0.15/8.6/0.30+5.0 ms cpu, 6->7->5 MB, 7 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 8 @0.065s 2%: 0.40+2.4+0.13 ms clock, 26+0.58/12/0+8.4 ms cpu, 10->11->7 MB, 10 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 9 @0.073s 4%: 0.75+2.2+0.52 ms clock, 48+0.16/13/0+33 ms cpu, 13->14->9 MB, 14 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 10 @0.081s 5%: 0.51+3.6+0.89 ms clock, 32+0.21/17/0+57 ms cpu, 17->20->12 MB, 19 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 11 @0.094s 5%: 0.73+4.0+0.47 ms clock, 47+0.069/25/0.29+30 ms cpu, 24->25->16 MB, 26 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 12 @0.141s 4%: 0.054+3.8+0.10 ms clock, 3.4+0.15/30/10+6.9 ms cpu, 31->32->24 MB, 34 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 13 @0.156s 4%: 1.9+17+0.31 ms clock, 125+8.4/35/0+19 ms cpu, 44->76->57 MB, 49 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 14 @0.183s 6%: 3.9+10+0.26 ms clock, 255+16/46/1.3+16 ms cpu, 109->140->82 MB, 116 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 15 @0.202s 6%: 1.1+8.4+0.36 ms clock, 73+10/37/0+23 ms cpu, 157->233->153 MB, 167 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 16 @0.217s 6%: 1.2+10+0.54 ms clock, 77+0.14/46/0+34 ms cpu, 265->354->196 MB, 308 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 17 @0.234s 7%: 1.4+6.1+1.4 ms clock, 94+1.6/39/0+94 ms cpu, 335->397->215 MB, 394 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 18 @0.249s 7%: 2.6+29+3.8 ms clock, 167+0.044/48/0+246 ms cpu, 370->434->267 MB, 432 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 19 @0.305s 7%: 2.8+12+2.0 ms clock, 184+0.046/100/0+133 ms cpu, 457->471->304 MB, 536 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 20 @0.386s 6%: 0.65+52+1.3 ms clock, 41+0/247/13+86 ms cpu, 519->582->532 MB, 610 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 21 @0.667s 5%: 1.7+100+1.2 ms clock, 115+0.031/633/0+80 ms cpu, 908->1029->940 MB, 1066 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 22 @1.076s 5%: 7.1+231+8.0 ms clock, 454+28/1096/1.6+512 ms cpu, 1603->2184->1839 MB, 1882 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 23 @1.866s 5%: 1.2+234+4.6 ms clock, 78+0.072/2039/0+295 ms cpu, 3134->3539->2541 MB, 3680 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 24 @2.645s 5%: 1.8+324+5.5 ms clock, 119+0.040/3268/19+355 ms cpu, 4330->4815->3615 MB, 5084 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 25 @3.912s 5%: 5.0+475+7.2 ms clock, 320+0.031/4847/0+463 ms cpu, 6160->7026->5453 MB, 7233 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 26 @5.752s 5%: 5.4+611+4.0 ms clock, 348+12/6917/0+257 ms cpu, 9290->10407->7943 MB, 10907 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 27 @8.341s 5%: 4.6+776+5.5 ms clock, 295+10/10238/0+355 ms cpu, 13531->14891->11409 MB, 15887 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 28 @12.533s 5%: 2.9+1116+6.9 ms clock, 187+54/15377/0+441 ms cpu, 20773->22817->17236 MB, 22820 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 29 @18.833s 5%: 5.3+1594+7.7 ms clock, 341+56/23711/15+493 ms cpu, 31523->34216->25726 MB, 34474 MB goal, 0 MB stacks, 0 MB globals, 64 P
gc 30 @27.104s 5%: 5.5+2052+0.72 ms clock, 356+136/32755/5565+46 ms cpu, 46963->50694->38246 MB, 51454 MB goal, 0 MB stacks, 0 MB globals, 64 P

(Assists are here in the format: <sweep term STW>+<assist>/<dedicated>/<idle>+<mark term STW> ms cpu)

(Note: the GC CPU fraction reported by gctrace is likely lower for Green Tea than it should be because the rest of the program is slower, but actual CPU time and CPU cycle numbers confirm about a 20% reduction, repeatably.)

Assists put back-pressure on allocation, putting space and time between page faults, reducing contention. With added contention, a spin lock especially, will likely have its performance totally collapse.

Another minor piece of evidence supporting this hypothesis is that the perf does show some samples in the spin lock without Green Tea, too. Just less.

Like I said, I'm not 100% sold yet on this idea. There's a lot I'm still unsure of when it comes to where the page fault contention is really coming from. I think there are more experiments I can run to try to make sure Green Tea isn't introducing any new artifacts here.

mknyszek avatar Jun 17 '25 02:06 mknyszek

In case it wasn't already clear, I'm fairly certain this case in do_wp_page is doing COW of a read-only mapping due to a write. I'm not 100% sure, but ptep_clear_flush is called from wp_page_copy (which is called from do_wp_page, presumably inlined). Replacing a RO page with a RW page seems like the most obvious reason a page fault would need to do a TLB shootdown in the first place.

I'd guess that the RO page being COW'd is the zero page?

Regarding the spin lock, the most likely candidate to me seems to be the page table lock. IIRC, this lock protects entries at the PMD level (that is, page table entries for 2MB of address space). It differs for huge pages, but I'm not sure huge pages use this code path.

Contention here to me implies that many threads are hitting faults within the same 2MB region (either the exact same page, or nearby).

Perhaps you could collect the exceptions:page_fault_user tracepoint (perf record -e exceptions:page_fault_user) to see if the fault addresses are close together. Though with this much CPU in the fault handler, tracing will likely emit a lot of events!

Edit: I'm also curious about a profile without Green Tea. Is the page fault overhead completely new with Green Tea? Or is it just the lock contention that is new?

prattmic avatar Jun 17 '25 06:06 prattmic

In case it wasn't already clear, I'm fairly certain this case in do_wp_page is doing COW of a read-only mapping due to a write. I'm not 100% sure, but ptep_clear_flush is called from wp_page_copy (which is called from do_wp_page, presumably inlined). Replacing a RO page with a RW page seems like the most obvious reason a page fault would need to do a TLB shootdown in the first place.

I hadn't dug into the sources yet, this is super helpful.

I'd guess that the RO page being COW'd is the zero page?

Oh... are you saying that memclrNoHeapPointers is writing the zero page? Some of the arithmetic in there seemed funky, and it occurs to me that that might be the trick -- if the size wraps back around we end up writing to the zero page (or something, I haven't read it too closely).

Regarding the spin lock, the most likely candidate to me seems to be the page table lock. IIRC, this lock protects entries at the PMD level (that is, page table entries for 2MB of address space). It differs for huge pages, but I'm not sure huge pages use this code path.

As an aside, I did notice that performance substantially improved both on my personal work VM and in the perf gomotes when I disabled huge pages with GODEBUG=disablethp=1. (This just calls MADV_NOHUGEPAGE on all heap memory.) Could be unrelated.

Contention here to me implies that many threads are hitting faults within the same 2MB region (either the exact same page, or nearby).

Perhaps you could collect the exceptions:page_fault_user tracepoint (perf record -e exceptions:page_fault_user) to see if the fault addresses are close together. Though with this much CPU in the fault handler, tracing will likely emit a lot of events!

Thanks! I'll give it a try.

Edit: I'm also curious about a profile without Green Tea. Is the page fault overhead completely new with Green Tea? Or is it just the lock contention that is new?

There are a lot of page faults both before and after, and the difference depends on what GOMAXPROCS is, but I can't really intuitively say why. (GOMAXPROCS=64 is barely different, GOMAXPROCS=16 has the version with Green Tea hit 1.5-2x the page fault count, but it's inconsistent.)

perf does attribute quite a few samples without Green Tea, but it just seems to blow up when Green Tea is enabled. Instead of 29.84% in asm_exc_page_fault it's like 10%? But I've found these numbers to be noisy -- I have to use task-clock because I don't have cycles on my personal work VM, but I do have kernel symbolization. Now that I know what it is though, I can probably just interpret the kernel address as asm_exc_page_fault in the perf gomote. 😅

mknyszek avatar Jun 17 '25 15:06 mknyszek

I'd guess that the RO page being COW'd is the zero page?

Oh... are you saying that memclrNoHeapPointers is writing the zero page? Some of the arithmetic in there seemed funky, and it occurs to me that that might be the trick -- if the size wraps back around we end up writing to the zero page (or something, I haven't read it too closely).

Apologies. To be more clear, my theory is:

  1. Go mmaps a new MAP_ANONYMOUS page. Linux does not change the page tables at all at this point.
  2. Go reads from the page. This triggers a page fault, as there is no mapping. Linux sees this is a read of a brand new anonymous page. Rather than actually allocating a new page at this point, Linux adds a RO mapping of its zero page to the page table. A read of a new mapping should read as zero, so this is semantically correct.
  3. Go later writes to the page. This triggers a page fault because the mapping is not writable. Now Linux must allocate a new page to replace the RO zero page mapping with a RW mapping to the new page.

I am reasonably confident that this is how Linux handles mapping and touching new pages (or at least how it used to), but I'm less confident if this access pattern is what's actually happening in the application.

prattmic avatar Jun 17 '25 17:06 prattmic

A small tangent that I just wanted to capture in this discussion: I was confused earlier about how memclrNoHeapPointers could be writing to fresh memory. After all, we have an optimization to avoid zeroing fresh memory. We do zero part of it if it could contain small objects with pointers, via initHeapBits, so that explains that. But what @prattmic brought to my attention is that some larger allocation might have pages inside of it that never get written to, and stay zeroed. Once we recycle that memory, we lose the state of it being freshly allocated, so the next access might end up using memclrNoHeapPointers.

I mentioned that hugo seems to have a monotonically growing heap if its bespoke memory limit is set high enough, but it's not like the entire heap is live at every cycle, just most of it. It does still generate garbage that is collected, and that memory is going to get reused.

mknyszek avatar Jun 17 '25 18:06 mknyszek