mmtk-core
mmtk-core copied to clipboard
Fast 64bit SFT map
This pull request adds an SFT map implementation for 64 bits systems, which maps one MMTk space to an entry in the SFT map. This closes https://github.com/mmtk/mmtk-core/issues/219 and closes https://github.com/mmtk/mmtk-core/issues/531.
This PR fixes the large memory usage and slow boot time caused by the sparse SFT map on 64bits.
- move SFT and SFT map code from the
space
module to separatesft
andsft_map
modules. - rename the current
struct SFTMap
tostruct SFTSparseChunkMap
- add
struct SFTSpaceMap
- add
struct SFTDenseChunkMap
- add
trait SFTMap
. BothSFTSpaceMap
andSFTChunkMap
implementtrait SFTMap
- add feature
malloc_mark_sweep
- it depends on the build which
SFTMap
implementation is used:- for 32 bits, use
SFTSparseChunkMap
. - for 64 bits with
malloc_mark_sweep
, useSFTDenseChunkMap
. - for 64 bits without
mallco_mark_sweep
, useSFTSpaceMap
.
- for 32 bits, use
binding-refs OPENJDK_BINDING_REPO=qinsoon/mmtk-openjdk OPENJDK_BINDING_REF=update-pr-632 JIKESRVM_BINDING_REPO=qinsoon/mmtk-jikesrvm JIKESRVM_BINDING_REF=update-pr-632
We talked with Steve today, and he suggested adding a 3rd SFT map implementation in the PR: the SFT map still uses a small table, and uses side metadata to store the table index for each chunk. This SFT map implementation would work well for malloc mark sweep on 64 bits.
So this PR is not yet ready for review. I will add the implementation, and let you know when it is ready. @wks
The following is the performance results for the PR (running Immix with 2x min heap).
benchmark time
GC time
Plotty link
https://squirrel.anu.edu.au/plotty/yilin/mmtk/#0|skunk-2022-09-21-Wed-101533&benchmark^build^invocation^iteration&bmtime^time.other^time.stw&|10&iteration^1^2&benchmark^2^batik&benchmark^2^jme|20&1^invocation|30&1&benchmark&build;jdk-mmtk-master|70&build^Build|41&Histogram%20(with%20CI)^build^benchmark&For most benchmarks, there is no measurable difference in both benchmark time and GC time. But there are some notable things in the results for some benchmarks:
-
h2o
: the mutator time for dense chunk map has a ~15% overhead. The overhead is possibly from 2 sources: 1) we need to map side metadata for this SFTMap impl, and 2) we hold the space acquire lock longer than the other two implementations, as we have to finish mapping side metadata before we release the lock. In certain workloads, this could be an issue. However, dense chunk map should only be used with malloc mark sweep on 64 bits, so the performance should not be a main concern in that scenario. -
biojava
: the GC time forbiojava
is strange. In this measurement, sparse chunk map and space map in the PR show a stable 8% slowdown. In my previous measurement (lets call itA
) which is based on a differentmaster
commit, the two builds showed a 15% slowdown. In my investigation of the previous measurement, the slowdown could be reproduced reliably. But any change to the build may change the results significantly, e.g. build withperf
, orwork_packet_stat
. For example, I built with the featureperf
on top ofA
but ran without any perf event, the results were significantly different -master
was slower than all the other builds. Without profiling, I am not able to further debug on the issue.
Any benchmark results about the boot time improvement or the memory footprint reduction after switching to the sparse map?
Can pretty much just do what I (or Kunshan) did here to measure boot time.
Any benchmark results about the boot time improvement or the memory footprint reduction after switching to the sparse map?
Can pretty much just do what I (or Kunshan) did here to measure boot time.
I will get some results for the boot time, and memory usage.
These are some results from booting OpenJDK. I was using /usr/bin/time --verbose build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap
for each build, and ran it once. I also tried with what Kunal proposed (running build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap -Xms4M -Xmx4M -verbose:gc -Xlog:::time,level,tags -Xlog:gc*
), and looked at the log time for the second thread attach.
Elapsed wall clock time (s) | Maximum resident set size (kbytes) | thread attach log time | |
---|---|---|---|
Stock OpenJDK | 0.17 | 60696 | 0.021s |
MMTk - sparse chunk map (master) | 0.52 | 839596 | 0.349s |
MMTk - sparse chunk map (PR) | 0.53 | 838964 | 0.349s |
MMTk - dense chunk map (PR) | 0.31 | 317160 | 0.163s |
MMTk - space map (PR) | 0.30 | 314604 | 0.151s |
This seems suggesting we still using much more memory at boot time. We may want to look further into this.
These are some results from booting OpenJDK. I was using
/usr/bin/time --verbose build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap
for each build, and ran it once. I also tried with what Kunal proposed (runningbuild/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap -Xms4M -Xmx4M -verbose:gc -Xlog:::time,level,tags -Xlog:gc*
), and looked at the log time for the second thread attach. Elapsed wall clock time (s) Maximum resident set size (kbytes) thread attach log time Stock OpenJDK 0.17 60696 0.021s MMTk - sparse chunk map (master) 0.52 839596 0.349s MMTk - sparse chunk map (PR) 0.53 838964 0.349s MMTk - dense chunk map (PR) 0.31 317160 0.163s MMTk - space map (PR) 0.30 314604 0.151sThis seems suggesting we still using much more memory at boot time. We may want to look further into this.
So we still aren't as fast stock OpenJDK as well. I presume these are results from running an empty Java file/class?
These are some results from booting OpenJDK. I was using
/usr/bin/time --verbose build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap
for each build, and ran it once. I also tried with what Kunal proposed (runningbuild/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap -Xms4M -Xmx4M -verbose:gc -Xlog:::time,level,tags -Xlog:gc*
), and looked at the log time for the second thread attach. Elapsed wall clock time (s) Maximum resident set size (kbytes) thread attach log time Stock OpenJDK 0.17 60696 0.021s MMTk - sparse chunk map (master) 0.52 839596 0.349s MMTk - sparse chunk map (PR) 0.53 838964 0.349s MMTk - dense chunk map (PR) 0.31 317160 0.163s MMTk - space map (PR) 0.30 314604 0.151s This seems suggesting we still using much more memory at boot time. We may want to look further into this.So we still aren't as fast stock OpenJDK as well. I presume these are results from running an empty Java file/class?
I ran without a class (so it will fail). And I just tried with an empty class. The results are the same.
A totally empty class for me using the PR has the same performance?
Command being timed: "./build/linux-x86_64-normal-server-release/images/jdk/bin/java Empty.java"
User time (seconds): 0.65
System time (seconds): 0.03
Maximum resident set size (kbytes): 80440
[...]
Command being timed: "./build/linux-x86_64-normal-server-release/images/jdk/bin/java -XX:+UseThirdPartyHeap Empty.java"
User time (seconds): 0.66
System time (seconds): 0.09
Maximum resident set size (kbytes): 344428
$ cat Empty.java
class Empty {
public static void main(String[] args) {
}
}
Still has 4x memory usage though.
Even attach time of both threads is the same in my build. Just checking -- your builds were built using make images
or make product-bundles
, right?
Even attach time of both threads is the same in my build. Just checking -- your builds were built using
make images
ormake product-bundles
, right?
I was using exploded image. The attach time of the first thread is always 0.001s
, but the attach time for the second thread depends on the MMTk initialization speed.
A totally empty class for me using the PR has the same performance?
Command being timed: "./build/linux-x86_64-normal-server-release/images/jdk/bin/java Empty.java" User time (seconds): 0.65 System time (seconds): 0.03 Maximum resident set size (kbytes): 80440 [...] Command being timed: "./build/linux-x86_64-normal-server-release/images/jdk/bin/java -XX:+UseThirdPartyHeap Empty.java" User time (seconds): 0.66 System time (seconds): 0.09 Maximum resident set size (kbytes): 344428 $ cat Empty.java class Empty { public static void main(String[] args) { } }
Still has 4x memory usage though.
Yeah. But the memory overhead might be from somewhere else. This PR should minimize the memory overhead of SFT map.
I was using exploded image. The attach time of the first thread is always 0.001s, but the attach time for the second thread depends on the MMTk initialization speed.
Yeah as we've found before, benchmarking using the exploded image is a bit questionable. At this point it seems like the slow start up issue is solved (on the product images), but there is some other issue as well which is increasing the memory usage.
These are some results from booting OpenJDK. I was using
/usr/bin/time --verbose build/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap
for each build, and ran it once. I also tried with what Kunal proposed (runningbuild/jdk-mmtk-dense-chunk-map/jdk/bin/java -XX:+UseThirdPartyHeap -Xms4M -Xmx4M -verbose:gc -Xlog:::time,level,tags -Xlog:gc*
), and looked at the log time for the second thread attach.Elapsed wall clock time (s) Maximum resident set size (kbytes) thread attach log time Stock OpenJDK 0.17 60696 0.021s MMTk - sparse chunk map (master) 0.52 839596 0.349s MMTk - sparse chunk map (PR) 0.53 838964 0.349s MMTk - dense chunk map (PR) 0.31 317160 0.163s MMTk - space map (PR) 0.30 314604 0.151s This seems suggesting we still using much more memory at boot time. We may want to look further into this.
@k-sareen I tried /usr/bin/time
with an image from make images
on a different machine. So the numbers are different. But it is still clear that we are using more memory and have a slower boot time.
Elapsed wall clock time (s) | Maximum resident set size (kbytes) | |
---|---|---|
Stock OpenJDK | 0.07 | 35924 |
MMTk - space map (PR) | 0.15 | 297384 |
So our resident set size is consistent (~300M), but with make images
, the stock GC's resident size is smaller (from 60M to 35M). Our boot time is roughly ~80ms (VMMap::new()
takes ~60ms on that machine).