Ability to set CPU configuration at runtime
I see that you can use BLIS_ARCH_DEBUG=1 to see what CPU configuration was selected at runtime, but it would be handy if you could set the CPU configuration at runtime instead of recompiling. The reasoning is to have the ability to act like MKL's MKL_CBWR environment variable which will allow you to specify the instruction set at runtime. This is useful when trying to create reproducible results across different machine types. For instance a Haswell machine can use AVX2, but not AVX-512. If you wanted to create a program that ran across a heterogenous set of Haswell and Skylake machines that produced the same result you would need to specify that the software built on the Skylake nodes used the haswell configuration. I would like to be able to specify that at runtime so the program would only use AVX2 instructions but leave the software configured for auto. This would allow me to run using AVX2 for some experiments, and AVX-512 for others. I went through the documentation, and did not find a way to tweak this setting at runtime, so if this already exists, please point me to the proper documentation.
Also https://github.com/flame/blis/pull/351 was merged, so you can update https://github.com/flame/blis/blob/2d8ec164e7ae4f0c461c27309dc1f5d1966eb003/frame/base/bli_arch.c#L78
@decandia50 if you configure using e.g. configure intel then it will compile in all the Intel architectures and select the proper one at runtime. While this isn't exactly the feature you asked for, it sounds like it would solve part of your problem. What you wouldn't get is e.g. using AVX2 on SkylakeX instead of AVX-512, but it's not clear why you would want to do that.
@devinamatthews thanks for the response, but that's exactly what I'm trying to avoid doing. As you note I can recompile down to a known set of common denominator CPU instructions, but what I'm trying to accomplish is effectively use a specific set of instructions at runtime without the need to recompile/redistribute my software (the code I need is already in there, but I can't self select it).
For a scenario - Let's say I care about number reproducibility with respect to floating point. And also that I have a large HPC cluster with heterogenous host/CPU types. Some support AVX2, some support AVX-512. In a common scenario I will have tools like numpy linked against BLIS, and I will make a calculation like np.linalg.norm(A@B) as part of some regression test suite. What I would expect is that given a known A and B that each host in the cluster would be able to reproduce the same result. However, because BLIS will autodetect the CPU and use AVX2 in some cases and AVX512 in others there is no way to specify that I care more about number reproducibility than performance at runtime without recompiling and redistributing the code, dependencies, and libraries to all hosts. In a large batch-like HPC system you may see many workloads. Some will desire absolute performance, and will want AVX-512 others will require reproducibility and only require the lowest common instruction set. For folks who care about floating point reproducibility this is fairly important. As was once told to me "diff is a wonderful debugging tool".
Oh, I didn't not see that it's reproducibility that is the main issue. I think this feature should be relatively easy to add, but i can't hazard a guess on a timeline. Do note that, for a reproducible answer, you will also need to run with the same number of threads on each machine.
Do note that, for a reproducible answer, you will also need to run with the same number of threads on each machine.
Indeed. The thread count is very important for reproducibility. In many cases where reproducibility is required the BLAS functions are often run single threaded to simplify things; e.g. BLIS_NUM_THREADS=1 BLIS_JC_NT=1 BLIS_IC_NT=1 BLIS_JR_NT=1 BLIS_IR_NT=1
@decandia50 I've added support for observing the BLIS_ARCH_TYPE environment variable to manually override the automatic subconfiguration selection mechanism. (See commit 2a0682f.) Please note that it must be set to (1) an arch_t id value within the defined range, 0 to BLIS_NUM_ARCHS-1, as defined in frame/include/bli_type_defs.h (note that these enum values may change in future commits!), and (2) an arch_t id value that corresponds to a subconfiguration that is actually compiled into the library to which the executable was linked. If either condition is not met, BLIS will abort with an error message.
Note that you can still use BLIS_ARCH_DEBUG to confirm the subconfiguration selected, whether it is configure-determined, automatic at runtime, or manually overriden.
Hopefully this feature, as implemented, is satisfactory for your purposes. Please test it out and pass along your feedback.
I appreciate the detail you included in your initial issue post and follow-up messages. Ultimately, this is what caught my attention and prompted me to spend part of my weekend on this. Happy early Hanukkah/Christmas/Festivus/birthday/whatever. :)
Also #351 was merged, so you can update
https://github.com/flame/blis/blob/2d8ec164e7ae4f0c461c27309dc1f5d1966eb003/frame/base/bli_arch.c#L78
Thanks for this reminder. I folded that change into 2a0682f.