mpir
mpir copied to clipboard
Building for Skylake with no AVX support (e.g G4400 or G4500) fails.
Build for UNIX on Skylake (e.g. G4400 or G4500) fails because of AVX instruction marked as "black magic" presence in add_n.as
and sub_n.as
. Commenting out such an instruction makes no changes in tests passing.
Are there some explanation what kind of "black magic" is this?
Because such a "black magic" block the library from functioning on entire set of CPUs, maybe consider commenting it and moving such assemblies to AVX directory? (I could make a pull-request)
I sincerely doubt that the code is doing the right thing with instructions commented out. In fact, I would say this is absolutely impossible.
Skylake is supposed to have AVX, but Intel continues to insist on selling cut down versions of their processors under the Pentium brand with various features missing.
We certainly can't remove AVX instructions from Skylake code since AVX makes a big difference to performance on proper server chips.
I recommend building your MPIR for a different processor that doesn't support AVX.
If Intel happened to use a different model and family number for this particular chip, we could work around it easily enough. But unless someone wants to write AVX detection code and do all the rearranging necessary to support this properly on a per CPU basis, it really has to stay the way it is.
No, I actually meant commenting out two lines commented as "black magic" (https://github.com/wbhart/mpir/blob/55fe6a9f52ca532a611a89f67186ed915bbf1123/mpn/x86_64/skylake/add_n.as#L56 and https://github.com/wbhart/mpir/blob/55fe6a9f52ca532a611a89f67186ed915bbf1123/mpn/x86_64/skylake/sub_n.as#L56), because they persist in non-AVX assemblies. All the AVX-enabled assemblies are inside the avx directory, which is perfectly fine.
These ones, I mentioned, are just not having any effect in these "black magic" instructions.
Ok, I see what you are saying. As the AVX registers are not used elsewhere in the code, they are just spurious.
When I get a chance I'll look into it carefully. It's ages since I looked at this assembly code, so it might take some time for me to check it. It'll also need testing for hours with the "try" test program.
I don't recognise these but I'd guess they are a very specific bit-length noop or something like that so there would only be a potential speed issue if removed. Maybe a same length single instruction non avx noop can be found?
We had a G4400 user having problems with the Skylake non-AVX build. GC worked fine for him, as did Broadwell non-AVX so we will be switching to using the latter for Skylake non-AVX.
I got these problems as well. Thinking of nop-ing same amount of bits.
On 13 Jun 2020, at 08:39, wjblanke [email protected] wrote:
We had a G4400 user having problems with the Skylake non-AVX build. GC worked fine for him, as did Broadwell non-AVX so we will be switching to using the latter for Skylake non-AVX.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wbhart/mpir/issues/274#issuecomment-643573688, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVDIJDHEGAEVIA7SVUM5BDRWMGHJANCNFSM4H6QS73Q.
vpblendd is copying DWORDs from 2nd operand to 1st operand based on which bits are set in 3rd (immediate) operand.
Note that the 2nd operand (so called source operand) is the combination of the two bolded operands below.
vpblendd YMM0, **YMM0**, **YMM0**, 0
So in fact, the instruction does exactly nothing from an ISA standpoint.
It appears that the Jens Nurman who is one of the authors of this file (add_n.s) in the copyright at the top discovered that there is a way to hint to Skylake processors to use port 7 which results in a measurable latency speedup which he documented in this post:
https://www.agner.org/optimize/blog/read.php?i=415#796
Unfortunately my understanding of how this works ends there so it appears that the "black magic" admonition is well deserved.
However, for those looking for a workaround because they are on Skylake but don't have AVX, removing the vpblendd
line probably isn't going to produce any correctness bugs.