mmtk-core icon indicating copy to clipboard operation
mmtk-core copied to clipboard

Fast 64bit SFT map

Open qinsoon opened this issue 2 years ago • 3 comments

This pull request adds an SFT map implementation for 64 bits systems, which maps one MMTk space to an entry in the SFT map. This closes https://github.com/mmtk/mmtk-core/issues/219 and closes https://github.com/mmtk/mmtk-core/issues/531.

This PR fixes the large memory usage and slow boot time caused by the sparse SFT map on 64bits.

  • move SFT and SFT map code from the space module to separate sft and sft_map modules.
  • rename the current struct SFTMap to struct SFTSparseChunkMap
  • add struct SFTSpaceMap
  • add struct SFTDenseChunkMap
  • add trait SFTMap. Both SFTSpaceMap and SFTChunkMap implement trait SFTMap
  • add feature malloc_mark_sweep
  • it depends on the build which SFTMap implementation is used:
    • for 32 bits, use SFTSparseChunkMap.
    • for 64 bits with malloc_mark_sweep, use SFTDenseChunkMap.
    • for 64 bits without mallco_mark_sweep, use SFTSpaceMap.

qinsoon avatar Jul 28 '22 02:07 qinsoon

binding-refs OPENJDK_BINDING_REPO=qinsoon/mmtk-openjdk OPENJDK_BINDING_REF=update-pr-632 JIKESRVM_BINDING_REPO=qinsoon/mmtk-jikesrvm JIKESRVM_BINDING_REF=update-pr-632

qinsoon avatar Jul 28 '22 23:07 qinsoon

We talked with Steve today, and he suggested adding a 3rd SFT map implementation in the PR: the SFT map still uses a small table, and uses side metadata to store the table index for each chunk. This SFT map implementation would work well for malloc mark sweep on 64 bits.

So this PR is not yet ready for review. I will add the implementation, and let you know when it is ready. @wks

qinsoon avatar Jul 29 '22 03:07 qinsoon

The following is the performance results for the PR (running Immix with 2x min heap).

benchmark time

sft-bmtime

GC time

sft-gctime

Plotty link https://squirrel.anu.edu.au/plotty/yilin/mmtk/#0|skunk-2022-09-21-Wed-101533&benchmark^build^invocation^iteration&bmtime^time.other^time.stw&|10&iteration^1^2&benchmark^2^batik&benchmark^2^jme|20&1^invocation|30&1&benchmark&build;jdk-mmtk-master|70&build^Build|41&Histogram%20(with%20CI)^build^benchmark&

For most benchmarks, there is no measurable difference in both benchmark time and GC time. But there are some notable things in the results for some benchmarks:

  • h2o: the mutator time for dense chunk map has a ~15% overhead. The overhead is possibly from 2 sources: 1) we need to map side metadata for this SFTMap impl, and 2) we hold the space acquire lock longer than the other two implementations, as we have to finish mapping side metadata before we release the lock. In certain workloads, this could be an issue. However, dense chunk map should only be used with malloc mark sweep on 64 bits, so the performance should not be a main concern in that scenario.
  • biojava: the GC time for biojava is strange. In this measurement, sparse chunk map and space map in the PR show a stable 8% slowdown. In my previous measurement (lets call it A) which is based on a different master commit, the two builds showed a 15% slowdown. In my investigation of the previous measurement, the slowdown could be reproduced reliably. But any change to the build may change the results significantly, e.g. build with perf, or work_packet_stat. For example, I built with the feature perf on top of A but ran without any perf event, the results were significantly different - master was slower than all the other builds. Without profiling, I am not able to further debug on the issue.

qinsoon avatar Sep 23 '22 01:09 qinsoon

Any benchmark results about the boot time improvement or the memory footprint reduction after switching to the sparse map?

Can pretty much just do what I (or Kunshan) did here to measure boot time.

k-sareen avatar Sep 26 '22 02:09 k-sareen

Any benchmark results about the boot time improvement or the memory footprint reduction after switching to the sparse map?

Can pretty much just do what I (or Kunshan) did here to measure boot time.

I will get some results for the boot time, and memory usage.

qinsoon avatar Sep 26 '22 02:09 qinsoon

These are some results from booting OpenJDK. I was using /usr/bin/time --verbose build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap for each build, and ran it once. I also tried with what Kunal proposed (running build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap -Xms4M -Xmx4M -verbose:gc -Xlog:::time,level,tags -Xlog:gc*), and looked at the log time for the second thread attach.

Elapsed wall clock time (s) Maximum resident set size (kbytes) thread attach log time
Stock OpenJDK 0.17 60696 0.021s
MMTk - sparse chunk map (master) 0.52 839596 0.349s
MMTk - sparse chunk map (PR) 0.53 838964 0.349s
MMTk - dense chunk map (PR) 0.31 317160 0.163s
MMTk - space map (PR) 0.30 314604 0.151s

This seems suggesting we still using much more memory at boot time. We may want to look further into this.

qinsoon avatar Sep 26 '22 04:09 qinsoon

These are some results from booting OpenJDK. I was using /usr/bin/time --verbose build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap for each build, and ran it once. I also tried with what Kunal proposed (running build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap -Xms4M -Xmx4M -verbose:gc -Xlog:::time,level,tags -Xlog:gc*), and looked at the log time for the second thread attach. Elapsed wall clock time (s) Maximum resident set size (kbytes) thread attach log time Stock OpenJDK 0.17 60696 0.021s MMTk - sparse chunk map (master) 0.52 839596 0.349s MMTk - sparse chunk map (PR) 0.53 838964 0.349s MMTk - dense chunk map (PR) 0.31 317160 0.163s MMTk - space map (PR) 0.30 314604 0.151s

This seems suggesting we still using much more memory at boot time. We may want to look further into this.

So we still aren't as fast stock OpenJDK as well. I presume these are results from running an empty Java file/class?

k-sareen avatar Sep 26 '22 05:09 k-sareen

These are some results from booting OpenJDK. I was using /usr/bin/time --verbose build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap for each build, and ran it once. I also tried with what Kunal proposed (running build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap -Xms4M -Xmx4M -verbose:gc -Xlog:::time,level,tags -Xlog:gc*), and looked at the log time for the second thread attach. Elapsed wall clock time (s) Maximum resident set size (kbytes) thread attach log time Stock OpenJDK 0.17 60696 0.021s MMTk - sparse chunk map (master) 0.52 839596 0.349s MMTk - sparse chunk map (PR) 0.53 838964 0.349s MMTk - dense chunk map (PR) 0.31 317160 0.163s MMTk - space map (PR) 0.30 314604 0.151s This seems suggesting we still using much more memory at boot time. We may want to look further into this.

So we still aren't as fast stock OpenJDK as well. I presume these are results from running an empty Java file/class?

I ran without a class (so it will fail). And I just tried with an empty class. The results are the same.

qinsoon avatar Sep 26 '22 05:09 qinsoon

A totally empty class for me using the PR has the same performance?

Command being timed: "./build/linux-x86_64-normal-server-release/images/jdk/bin/java Empty.java"
        User time (seconds): 0.65
        System time (seconds): 0.03
        Maximum resident set size (kbytes): 80440

[...]

Command being timed: "./build/linux-x86_64-normal-server-release/images/jdk/bin/java -XX:+UseThirdPartyHeap Empty.java"
        User time (seconds): 0.66
        System time (seconds): 0.09
        Maximum resident set size (kbytes): 344428

$  cat Empty.java
class Empty {
    public static void main(String[] args) {
    }
}

Still has 4x memory usage though.

k-sareen avatar Sep 26 '22 06:09 k-sareen

Even attach time of both threads is the same in my build. Just checking -- your builds were built using make images or make product-bundles, right?

k-sareen avatar Sep 26 '22 06:09 k-sareen

Even attach time of both threads is the same in my build. Just checking -- your builds were built using make images or make product-bundles, right?

I was using exploded image. The attach time of the first thread is always 0.001s, but the attach time for the second thread depends on the MMTk initialization speed.

qinsoon avatar Sep 26 '22 08:09 qinsoon

A totally empty class for me using the PR has the same performance?

Command being timed: "./build/linux-x86_64-normal-server-release/images/jdk/bin/java Empty.java"
        User time (seconds): 0.65
        System time (seconds): 0.03
        Maximum resident set size (kbytes): 80440

[...]

Command being timed: "./build/linux-x86_64-normal-server-release/images/jdk/bin/java -XX:+UseThirdPartyHeap Empty.java"
        User time (seconds): 0.66
        System time (seconds): 0.09
        Maximum resident set size (kbytes): 344428

$  cat Empty.java
class Empty {
    public static void main(String[] args) {
    }
}

Still has 4x memory usage though.

Yeah. But the memory overhead might be from somewhere else. This PR should minimize the memory overhead of SFT map.

qinsoon avatar Sep 26 '22 08:09 qinsoon

I was using exploded image. The attach time of the first thread is always 0.001s, but the attach time for the second thread depends on the MMTk initialization speed.

Yeah as we've found before, benchmarking using the exploded image is a bit questionable. At this point it seems like the slow start up issue is solved (on the product images), but there is some other issue as well which is increasing the memory usage.

k-sareen avatar Sep 27 '22 01:09 k-sareen

These are some results from booting OpenJDK. I was using /usr/bin/time --verbose build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap for each build, and ran it once. I also tried with what Kunal proposed (running build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap -Xms4M -Xmx4M -verbose:gc -Xlog:::time,level,tags -Xlog:gc*), and looked at the log time for the second thread attach.

Elapsed wall clock time (s) Maximum resident set size (kbytes) thread attach log time Stock OpenJDK 0.17 60696 0.021s MMTk - sparse chunk map (master) 0.52 839596 0.349s MMTk - sparse chunk map (PR) 0.53 838964 0.349s MMTk - dense chunk map (PR) 0.31 317160 0.163s MMTk - space map (PR) 0.30 314604 0.151s This seems suggesting we still using much more memory at boot time. We may want to look further into this.

@k-sareen I tried /usr/bin/time with an image from make images on a different machine. So the numbers are different. But it is still clear that we are using more memory and have a slower boot time.

Elapsed wall clock time (s) Maximum resident set size (kbytes)
Stock OpenJDK 0.07 35924
MMTk - space map (PR) 0.15 297384

So our resident set size is consistent (~300M), but with make images, the stock GC's resident size is smaller (from 60M to 35M). Our boot time is roughly ~80ms (VMMap::new() takes ~60ms on that machine).

qinsoon avatar Sep 28 '22 03:09 qinsoon