NGen
NGen copied to clipboard
SIMD Intrinsics in the JVM
Artifact description
Submission and reviewing guidelines and methodology: http://cTuning.org/ae/submission-20160509.html
Abstract
To reproduce the results presented in our work, we provide an artifact that consist of two parts:
-
lms-intrinsicsa precompiledjarlibrary that includes all Intel-based SIMD intrinsics functions, implemented as Scala eDSLs in LMS. -
NGenruntime implemented in Scala and Java, that enables the use oflms-intrinsicsin the JVM and includes the experiments discussed in our work.
The SIMD based eDSLs follow the modular design of the LMS framework and
are implemented as an external LMS library, separated from the JVM
runtime. This allows a stand-alone use of lms-intrinsics, enabling LMS
to generate x86 vectorized code outside the context of the JVM. The
JVM runtime (NGen) demonstrates the use of the lms-intrinsics by
providing the compiler pipeline to generate, compile, link and execute
the LMS-generated SIMD code and has a strong dependency on this library.
The experiments included in the artifact come in the form of microbenchmarks. While the most convenient deployment for this artifact would have been a Docker image through Collective Knowledge, we decided to eliminate the overhead imposed by the containers and provided a bare metal deployment that aims at providing as precise results possible for our tests. To achieve that, we use SBT (Simple Build Tool) to build and execute our experiments.
Description
Check-list (artifact meta information)
- Algorithm: Using SIMD intrinsics in the JVM. Experiments include dot-product on quantized arrays, BLAS routines: SAXPY and Matrix-Matrix-Multiplication.
- Compilation:
lms-intrinsicsis a precompiled library, compiled with Scala 2.11 and is available as ajarbundle, accessible through Maven.NGenrequires Scala 2.11 and Java 1.8 for compilation. BothNGenandlms-intrinsicsgenerateCcode that is compiled withGCC,ICCorLLVM. - Transformations: To make SIMD instructions available in the JVM,
NGenuses LMS as a staging framework. The user writes vectorized code as eDSL in Scala andNGenstages the code through multiple compile phases before execution. - Binary:
lms-intrinsicsis ajarbundle.NGenincludes binaries for SBT v0.13.6, as well as small library forCPUIDinspection and Sigar v1.6.5_01 (System Information Gatherer And Reporter https://github.com/hyperic/sigar) binaries.NGenhas various dependencies on precompiled libraries that include BridJ, Apache Commons, ScalaMeter, Scala Virtualized, LMS and finallylms-intrinsics. SBT automatically pulls all dependencies and their corresponding versions. - Data set: Our experiments operate with random data, requiring no data set.
- Run-time environment:
lms-intrinsicscan run on any JVM that supports LMS and any operating system supported by the same JVM. Similarly,NGencould work in any JVM that supports LMS, reflection and native code invocation, however our focus has been on the HotSpot JVM only, supporting Windows, Linux and Mac OS X. Our results are most conveniently replicated on a Unix environment. - Hardware: The
NGenandlms-intrinsicsgenerated code can run on anyx86andx86-64architecture that supports at least one subset of the Intel intrinsics functions. We recommend a Haswell machine for verifying the results presented in the paper to obtain comparable results. - Run-time state: We perform our tests using warm cache scenario, warming the code and data cache many times before measurements begin. We advise that the replication of our experiments to be done with minimal interference of other applications running on the system, having technologies for frequency scaling and resource sharing disabled.
- Output:
NGengenerates performance profile of each algorithm presented in this paper. - Experiment workflow: We use SBT not only to compile the code, but also to run the experiments.
- Experiment customization: Customisation is certainly possible and can be easily achieved by implementing any vectorized code as a Scala eDSL.
- Publicly available: Yes
How delivered
The precompiled SIMD eDSLs library, as well as our JVM runtime, including the supporting experiments are publicly available through GitHub, on the following links:
Note that lms-intrinsics is also available through Maven, and can be
used through SBT directly:
libraryDependencies += "ch.ethz.acl" %% "lms-intrinsics" % "0.0.3-SNAPSHOT"
Hardware dependencies
lms-intrinsics as well as NGen are able to generate C code that
can run on x86 and x86-64 architecture supporting Intel ISAs.
However, the full set of our experiments require at least a Haswell
machine. Namely:
- SAXPY and MMM algorithms are implemented using
AVXandFMAISAs, and therefore require at least a Haswell enabled process. Broadwell, Skylake, Kaby Lake or later would also work. - The dot product of the quantized arrays relies on
AVX2, andFMAflags, but also uses the hardware random number generator, requiring theRDRANDISA, as wellFP16Cto deal with half-precision floats.
We recommend disabling Intel Turbo Boost and Hyper-Threading technologies to avoid the effects of frequency scaling and resource sharing on the measurements. Note that these technologies can be easily disabled in the BIOS settings of the machines that have BIOS firmware. Many Apple-based machines, such as the MacBook or others, do not have a user accessible BIOS firmware, and could only disable Turbo Boost using external kernel modules such as Turbo Boost Switcher (https://github.com/rugarciap/Turbo-Boost-Switcher).
Software dependencies
lms-intrinsics is a self-contained precompiled library and all of its
software dependencies are handled automatically through Maven tools such
as SBT. To build and run NGen, the following dependencies must be met:
Gitclient, used by SBT to resolve dependencies.- Java Development Kit (JDK) 1.8 or later.
Ccompiler such asGCC,ICCorLLVM.
After installing the dependencies, it is quite important to have the
binary executables available in the $PATH. This way the SBT tool will
be able to process all compilation phases as well as to execute the
experiments. Make sure that the following commands work on your
terminal:
git --version
gcc --version
java -version
javac -version
It is also important to ensure that the installed JVM has architecture
that GCC can compile to. This is particularly important for Windows
users: 32-bit MinGW port of GCC will fail to compile code for 64-bit
JVM.
Installation
The artifact can be cloned from the GitHub repository:
git clone https://github.com/astojanov/NGen
The artifact already includes a precompiled version of SBT. Therefore, to start the SBT console, we run:
cd ngen
# For Unix users:
./bin/sbt/bin/sbt
# For Windows users
bin\sbt\bin\sbt.bat
Once started, we can compile the code using:
> compile
Once invoked, SBT will automatically pull lms-intrinsics as well as
all other dependencies and start the compilation.
Experiment workflow
Once SBT compiles the code, we can proceed with evaluating our
experiments. We do this through the SBT console. To inspect the testing
machine through NGen runtime we use:
> test-only cgo.TestPlatform
The runtime will be able to inspect the CPU, identify available ISAs and compilers and inspect the current JDK. If the test platform is successfully identified, we can continue with the experiments.
Generating SIMD eDSLs.
The lms-intrinsics bundle includes the automatic generator of SIMD
eDSLs, invoked by:
> test-only cgo.GenerateIntrinsics
The Scala eDSLs (coupled with statistics) will be generated in
Generated_SIMD_Intrinsics folder.
Explicit vectorization in the JVM.
To run the experiments depicted in our work, we use:
> test-only cgo.TestSaxpy
> test-only cgo.TestMMM
> test-only cgo.TestPrecision
In the case of SAXPY algorithm, if the testing machine is not Haswell based, we provided an architecture independent implementation of SAXPY:
> test-only cgo.TestMultiSaxpy
Each result shows the size of our microbenchmarks, and the obtained performance in flops/cycle.
Evaluation and expected result
In the evaluation of the experiment workflow, we expect LMS to produce
correct vectorized code using lms-intrinsics. Furthermore, we expect
our performance results to depict a consistent behaviour to the results
shown in this work, outperforming the JVM on the microarchitectures that
support our experiments. Finally, we expect the automatic generation of
eDSLs to be easily adjustable to subsequent updates on the Intel
Intrinsics specifications.
Experiment customization
There are many opportunities for customization. We can use NGen to
easily develop vectorized code, and we can use ScalaMeter to adjust the
current benchmarks.
Developing SIMD code.
NSaxpy.scala class, available in src/ch/ethz/acl/ngen/saxpy/,
provides detailed guidelines for the usage of SIMD in Scala. Following
the comments in the file, as well as the structural flow of the program,
one can easily modify the skeleton to perform other type of vectorized
computations.
Customizing Benchmarks.
Each performance experiment, uses ScalaMeter and is implemented as a
Scala class. The Matrix-Matrix-Multiplication includes BenchMMM.scala
located in src/ch/ethz/acl/ngen/mmm/. The implementaton allows changes
to various aspects of the benchmarks, including the size and the values
of the input data, warm up times, different JVM invocations, etc.