mpir icon indicating copy to clipboard operation
mpir copied to clipboard

Building for Skylake with no AVX support (e.g G4400 or G4500) fails.

Open nemothenoone opened this issue 5 years ago • 7 comments

Build for UNIX on Skylake (e.g. G4400 or G4500) fails because of AVX instruction marked as "black magic" presence in add_n.as and sub_n.as. Commenting out such an instruction makes no changes in tests passing.

Are there some explanation what kind of "black magic" is this?

Because such a "black magic" block the library from functioning on entire set of CPUs, maybe consider commenting it and moving such assemblies to AVX directory? (I could make a pull-request)

nemothenoone avatar Jul 05 '19 23:07 nemothenoone

I sincerely doubt that the code is doing the right thing with instructions commented out. In fact, I would say this is absolutely impossible.

Skylake is supposed to have AVX, but Intel continues to insist on selling cut down versions of their processors under the Pentium brand with various features missing.

We certainly can't remove AVX instructions from Skylake code since AVX makes a big difference to performance on proper server chips.

I recommend building your MPIR for a different processor that doesn't support AVX.

If Intel happened to use a different model and family number for this particular chip, we could work around it easily enough. But unless someone wants to write AVX detection code and do all the rearranging necessary to support this properly on a per CPU basis, it really has to stay the way it is.

wbhart avatar Jul 06 '19 00:07 wbhart

No, I actually meant commenting out two lines commented as "black magic" (https://github.com/wbhart/mpir/blob/55fe6a9f52ca532a611a89f67186ed915bbf1123/mpn/x86_64/skylake/add_n.as#L56 and https://github.com/wbhart/mpir/blob/55fe6a9f52ca532a611a89f67186ed915bbf1123/mpn/x86_64/skylake/sub_n.as#L56), because they persist in non-AVX assemblies. All the AVX-enabled assemblies are inside the avx directory, which is perfectly fine.

These ones, I mentioned, are just not having any effect in these "black magic" instructions.

nemothenoone avatar Jul 06 '19 00:07 nemothenoone

Ok, I see what you are saying. As the AVX registers are not used elsewhere in the code, they are just spurious.

When I get a chance I'll look into it carefully. It's ages since I looked at this assembly code, so it might take some time for me to check it. It'll also need testing for hours with the "try" test program.

wbhart avatar Jul 06 '19 01:07 wbhart

I don't recognise these but I'd guess they are a very specific bit-length noop or something like that so there would only be a potential speed issue if removed. Maybe a same length single instruction non avx noop can be found?

alexjbest avatar Jul 09 '19 05:07 alexjbest

We had a G4400 user having problems with the Skylake non-AVX build. GC worked fine for him, as did Broadwell non-AVX so we will be switching to using the latter for Skylake non-AVX.

wjblanke avatar Jun 13 '20 05:06 wjblanke

I got these problems as well. Thinking of nop-ing same amount of bits.

On 13 Jun 2020, at 08:39, wjblanke [email protected] wrote:

We had a G4400 user having problems with the Skylake non-AVX build. GC worked fine for him, as did Broadwell non-AVX so we will be switching to using the latter for Skylake non-AVX.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wbhart/mpir/issues/274#issuecomment-643573688, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVDIJDHEGAEVIA7SVUM5BDRWMGHJANCNFSM4H6QS73Q.

nemothenoone avatar Jun 13 '20 06:06 nemothenoone

vpblendd is copying DWORDs from 2nd operand to 1st operand based on which bits are set in 3rd (immediate) operand.

Note that the 2nd operand (so called source operand) is the combination of the two bolded operands below.

vpblendd YMM0, **YMM0**, **YMM0**, 0

So in fact, the instruction does exactly nothing from an ISA standpoint.

It appears that the Jens Nurman who is one of the authors of this file (add_n.s) in the copyright at the top discovered that there is a way to hint to Skylake processors to use port 7 which results in a measurable latency speedup which he documented in this post:

https://www.agner.org/optimize/blog/read.php?i=415#796

Unfortunately my understanding of how this works ends there so it appears that the "black magic" admonition is well deserved.

However, for those looking for a workaround because they are on Skylake but don't have AVX, removing the vpblendd line probably isn't going to produce any correctness bugs.

hocheung20 avatar Aug 23 '20 20:08 hocheung20