Benchmarks & examples
After a discussion with @penzn and @RossTate, we realized it would be helpful to catalog some resources for auto-vectorizable and hand-vectorized code. These should be useful for benchmarking and staring at as examples to judge the flexibility of the "long vector" proposal.
Probably the canonical benchmark suite for small, vectorizable kernels is MediaBench.
It's old and dusty, though, and the code just a big bunch of undifferentiated C, so it's not all that helpful at revealing tight, vectorizable inner loops.
Something that might be a bit more helpful is VectorBench, from the MIT folks who have been working on the problem of "revitalizing" vectorized code hand-tuned for old SIMD ISAs so they can be recompiled for new SIMD ISAs.
Namely, a good place to start would be the code they adopted from this repository of a great deal of hand-vectorized image processing kernels.
Unfortunately, even these simple kernels reveal how maddeningly complex hand-vectorized code can truly be. For example, binarizing an image (taking a grayscale image and rounding to a black-and-white image) should be super simple, right? Here's the core loop for that: https://github.com/ermig1979/Simd/blob/ab23fb98f5bebfeeef170c8abbd1ab9d596b0246/src/Simd/SimdAvx512bwBinarization.cpp#L63-L74
And of course it gets way more gnarly if you're doing something less perfectly elementwise, like a stencil/filter type of thing: https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdAvx2Sobel.cpp
Perhaps a better way to start is to look at code that should be amenable to auto-vectorization but that is not yet hand-tuned with vector intrinsics. For example, NPB is an old and small benchmark suite with this characteristic.
VectorBench includes the OpenMP-annotated NPB suite. You can find such intriguing loop nests as this: https://github.com/revec/VectorBench/blob/77073a06779a1dbd948bd5f5a37660946ae07750/scalar/NPB3.0-omp-C/BT/bt.c#L241-L255
Of course, it's up to the reader's imagination to divine the best way to vectorize these loops. To that end, I think the revec paper itself can be helpful.
For example, take a gander at Figure 1, which shows the original scalar code, a vectorized version for SSE2, and the desired vectorization for two other ISAs (AVX2 and AVX-512).
Also, just because it came up on HN today, here's a nice explanation of a somewhat bonkers optimized fixed-length reduction that's worth pondering.
@sampsyo there is a conversation about code to compile in https://github.com/WebAssembly/flexible-vectors/pull/27#issuecomment-784051647. There a few interesting things in that PR, I am trying to separate them out a bit. From @lemaitre:
I know that's the tricky bit... Here is a small list of algorithms that definitely need to be implementable:
- gemm (matrix multiplication)
- reductions (not necessarily additions)
- Stencils
- prefix sums (1D)
- Integral images (like prefix sums but in 2D)
- interpolations
- On-The-Fly narrowing or widening (ie: chaining conversions with computation instead of having a separate conversion step)
I am also personally interested in more algorithms:
- Run-Length Encoding
- Union-Find (graph processing)
- Background subtraction (like Sigma-Delta)
Interesting discussion; thanks for the ping! This is a good list from @lemaitre. To break things down a bit, I sort of think there are two competing goals here:
- Apples-to-apples comparisons with native (and fixed-width WASM) SIMD, for which "standard" benchmark suites are the right thing because of their availability and portability.
- More realistic niche use cases, which can be oddballs that "stretch" the ISA a bit more to reveal interesting/challenging cases that are likely to arise in the wild.
To make things more complicated, there is also a separate dimension to distinguish "algorithms" (kernels, such as GEMM) from "applications" (such as an H.264 encoder). The former is probably the better focus, at least early on.
I think the thing that makes benchmarking SIMD ISAs/algorithms especially hard (compared to "normal" benchmarks) is that there is a huge range in implementation qualities. A naively auto-vectorized version of a simple 3x3 matrix product, for example, will be far worse than an expert-tuned implementation—and the expert-tuned implementation will differ radically for different ISA targets! Which means that, if you invent a new SIMD ISA (or WASM extension), it can be really hard to say whether it's "better" without doing a bunch of incredibly painful tuning for that architecture.
The problem is that the straightforward/obvious implementations of a given benchmark could be misleading w/r/t what the "peak" performance could be, given infinite programmer effort. "Infinite programmer effort" may seem like a fantasy, but in practice, the world has shown that it is willing to expend seemingly-infinite effort for high-value libraries like BLAS. That's not feasible to do for early-stage designs, so there will inevitably be some artistic guesswork involved about how much a given ISA feature will matter for industrial-strength vectorized implementations.