Speeding up `intrinsic-test`
The intrinsic-test crate runs incredibly slowly, and takes a long time both on CI and locally. I'd like to speed that up.
Looking at the code, it also shows signs of its age. I think we can do a much better job today. Based on some rough profiling, the main bottleneck appears to be the compilation of 3K+ C++ files into executables. On my machine each file takes roughly ~280ms to compile. By using C instead, and compiling to an object file, I'm able to get a ~4X speedup.
My idea is to emit C files like this (we emit many C files because clang won't parallelize its workload by itself):
#include <arm_neon.h>
#include <arm_acle.h>
#include <arm_fp16.h>
const uint32_t a_vals[] = {
0x0,
0x800000,
0x3effffff,
0x3f000000,
// ...
};
const uint8_t b_vals[] = {
0x0,
0x1,
0x2,
0x3,
// ...
};
uint32_t __crc32b_output[20] = {};
extern uint32_t *c___crc32b_generate(void) {
for (int i=0; i<20; i++) {
__crc32b_output[i] = __crc32b(a_vals[i], b_vals[i]);
}
return __crc32b_output;
}
The for rust, we can generate all the tests in one binary (not sure if splitting into files is useful there; it might be), and link it together to the C object files. Then the final check of the output can happen in rust (calling the rust and C version of the test and comparing results). This crucially means we only need to compile the formatting logic once (and in rust, so it's trivially consistent).
cc @adamgemmell @Jamesbarford if you have thoughts on this idea, or other ways to speed up this program.
I think your comment may have been cut off;
Looking at it now, we might even wan
?
Ah, yeah, I had another thought that we may want to generate number-of-cores files instead of 3k+ separate files.
Hi, that would be really helpful especially as the number of intrinsics tested grows, thanks!
We've had a chat with people in Arm who've worked on the tool and agree there's likely speed gains to be made here. intrinsic-test was originally built as a generator like stdarch-gen-arm and repurposed later on. C++ was originally used just because it was easier to override some behaviour related to printing values to ensure they matched Rust's output, you can see some templates and overrides of the stream operators in every file. Converting that to C while keeping the printing behaviour may be possible or it may be tricky.
Using C sounds like a good step but there may be some gnarly merge conflicts with the current x86 intrinsic-test GSOC project (https://github.com/rust-lang/stdarch/pull/1814) so perhaps it's not a good time to be making those changes.
I also thought that some speed could be gained by building fewer, larger binaries but I wasn't sure at what point linking would eat away at those gains. Some experimentation in that area sounds good.
Similarly avoiding printing and instead linking C/Rust together sounds like a good step. There was an issue recently (https://github.com/rust-lang/stdarch/pull/1813) which I think was caused by the test tool generating bad test files but tolerating empty output from test files. I've been meaning to make the tool verify that the output isn't empty as a sanity check but that might be harder with the array you demonstrate.
Some easy gains could be made by making the aarch64 job run on an aarch64 runner, but the downside of doing this is that x86 devs would no longer be able to run the job locally.
Also, many intrinsics files aren't exactly that format and have multiple loops present. Any intrinsic with a Constraint does this (basically a const value with a range), take a look at vdup_lane_* for example
In https://github.com/rust-lang/stdarch/pull/1856, CI time for aarch64-unknown-linux-gnu is down from 28 minutes to 8 minutes. We could still go faster, but at that point it is no longer the longest-running CI job. Locally it runs in about 110 seconds for me (x86_64, 16 cores). Decent, but I'm still not happy with that.
The basic approach there is to generate the number-of-cores .cpp/.rs files, and feeding those to the respective compilers. Both binaries contain a massive switch on the intrinsic name and will then run the corresponding test.
I think we can achieve much better performance by not spawning 7K processes on those poor CI machines, but that can wait. Merging this will probably take a while, getting the changes into something reviewable is tough.
Anyway, nicer things are possible.