simde icon indicating copy to clipboard operation
simde copied to clipboard

Elbrus (e2k) architecture support

Open makise-homura opened this issue 3 years ago • 18 comments

This pull request introduces support of Elbrus hardware platform (which is based on Russian Elbrus CPU family) with its native lcc (eLbrus Compiler Collection) compiler.

This is linked to the corresponding PR for obs-studio.

ninja test has been run after building, all tests passed, both on x86_64 and e2k.

There is no ARM instructions support for now, because it looks too hard to make current implementations compile normally on e2k; but it may be introduced in following PRs, if needed.

makise-homura avatar Feb 01 '21 18:02 makise-homura

This is awesome, thank you! I'm not familiar with the architecture, but I'd like to support it as best we can. I'm willing to merge more or less as-is, but maintenance is going to be a bit tricky…

Is it be possible to add a CI build, even if all we can do is cross-compile the tests from x86? If lcc can be installed on an x86 Linux machine it should be possible to add something to the development container as well. For what it's worth I'm happy to help turn generic installation instructions into a CI job.

Even better, of course, would be the ability to actually run the tests. Since no CI providers I'm aware of support e2k, this would likely mean an emulator; I don't see anything about qemu support for e2k, but maybe there is an out-of-tree patch we could use or something? Again, I'm happy to help get this integrated into our CI if possible.

nemequ avatar Feb 01 '21 19:02 nemequ

Thanks! It's nice to see that project maintainer is interested in such a PR.

Unfortunately, there are some problems that may cause trouble for CI. First, there is no adequate e2k emulator (it is WIP, but may take a lot of time until it will be ready). There is instruction-precise simulator though, but it is not available publicly, since it is considered an engineering tool, and even if it was available, it is too slow to run any CPU-heavy stuff in it, like building or testing (it's about 1000 times slower than an actual hardware). There is also cross compiler, but it is available on request only (lcc compilers, both native and cross ones, have EDG frontend, which is proprietary and can't be freely distributed AFAIK). But instead we have three publicly available E2K machines, which can be accessed by users if they provide desired username and public SSH key (still no root access though). If it is enough to setup CI, then I'd be happy to provide access (so you may set up CI not just for building, but also for running tests).

Regarding access, you may contact me through Discord (makise-homura#8793) or join Telegram group which is a discussion related to mentioned SSH-accessible machines (it is primarily in Russian language, but most of us can speak back in English if someone is asking there for something in English). If either one isn't an option for you, you may suggest another option instead I guess.

makise-homura avatar Feb 03 '21 00:02 makise-homura

I pushed recently some other changes enabling OpenMP (which is implicitly enabled with LCC without any option like -fopenmp-simd) and getting rid of remaining warnings. Still I have a few last ones:

/tmp/lcc_MouN4b.s: Assembler messages:
/tmp/lcc_MouN4b.s:65716: Warning: use of 'psllqh' with count > 15 leads to a zero result
/tmp/lcc_MouN4b.s:66745: Warning: use of 'psrlql' with count > 15 leads to a zero result
/tmp/lcc_60Ibsd.s: Assembler messages:
/tmp/lcc_60Ibsd.s:66815: Warning: use of 'psllqh' with count > 15 leads to a zero result
/tmp/lcc_60Ibsd.s:67846: Warning: use of 'psrlql' with count > 15 leads to a zero result

But they are probably low-level ones raised when assembler code is packed into VLIW word. Don't know what to do with them, and also have no idea if I should take them into account. Looks like kind of overflowed shifts, but I'm not sure.

makise-homura avatar Feb 03 '21 03:02 makise-homura

Thanks! It's nice to see that project maintainer is interested in such a PR.

Unfortunately, there are some problems that may cause trouble for CI. First, there is no adequate e2k emulator (it is WIP, but may take a lot of time until it will be ready). There is instruction-precise simulator though, but it is not available publicly, since it is considered an engineering tool, and even if it was available, it is too slow to run any CPU-heavy stuff in it, like building or testing (it's about 1000 times slower than an actual hardware). There is also cross compiler, but it is available on request only (lcc compilers, both native and cross ones, have EDG frontend, which is proprietary and can't be freely distributed AFAIK). But instead we have three publicly available E2K machines, which can be accessed by users if they provide desired username and public SSH key (still no root access though). If it is enough to setup CI, then I'd be happy to provide access (so you may set up CI not just for building, but also for running tests).

Regarding access, you may contact me through Discord (makise-homura#8793) or join Telegram group which is a discussion related to mentioned SSH-accessible machines (it is primarily in Russian language, but most of us can speak back in English if someone is asking there for something in English). If either one isn't an option for you, you may suggest another option instead I guess.

Okay, it sounds like requesting access via SSH is the right way to go for now; I'll do that in a bit. That way at least I'll be able to do some periodic builds and debugging. I'm definitely interested in adding a CI job, but that would require either setting up CI servers on e2k hardware or a freely distributable emulator + (cross-)compiler. Hopefully one day :)

I pushed recently some other changes enabling OpenMP (which is implicitly enabled with LCC without any option like -fopenmp-simd) and getting rid of remaining warnings. Still I have a few last ones:

That could be bugs in SIMDe. Intel tends to accept any 8-bit value for a lot of functions, so we have to handle things like shifting 16-bit lanes by 16+ bits, but IIRC that is UB. If we need to add some checks to do something like (count < 15) ? (a << count) : 0 in the portable code we can, but it sounds like this may be in the _mm_slli_epi16 path, in which case lcc should really handle that (since it is implementing Intel's API), though we could of course add a special case to work around the issue using a SIMDE_BUG_LCC_... macro.

nemequ avatar Feb 03 '21 18:02 nemequ

If we need to add some checks to do something like (count < 15) ? (a << count) : 0 in the portable code we can, but it sounds like this may be in the _mm_slli_epi16 path, in which case lcc should really handle that (since it is implementing Intel's API), though we could of course add a special case to work around the issue using a SIMDE_BUG_LCC_... macro.

Well, culprits are simde_mm_bslli_si128(a, 19); at line 908, and simde_mm_bsrli_si128(a, 19); at line 955 of test/x86/sse2.c when building sse2-native-c and sse2-native-cpp tests. Finally they end up in a call to _mm_slli_si128 and _mm_srli_si128 correspondingly. There is no way to check it inside a #define, because there is no way to use #if inside a #define. Also conditional expression like ((imm8 > 15) ? _mm_setzero_si128() : _mm_srli_si128(a, imm8)) is never optimized to exclude unreachable branch. So I tried to implement it as an inline function, like:

  #if defined(SIMDE_BUG_LCC_WARNING_ON_SHIFTS)
    inline __attribute__((always_inline)) simde__m128i
    simde_lcc_guard_mm_slli_si128(simde__m128i a, const int imm8)
    SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) {
      if (imm8 > 15) {
        return _mm_setzero_si128();
      } else {
        return _mm_slli_si128(a, imm8);
      }
    }
    #define simde_mm_bslli_si128(a, imm8) simde_lcc_guard_mm_slli_si128(a, imm8)
  #else
    #define simde_mm_bslli_si128(a, imm8) _mm_slli_si128(a, imm8)
  #endif

But, if I use static for simde_lcc_guard_mm_slli_si128, no inlining occurs, and compilation fails due to imm8 being a variable inside a function, and not a constant. If no static is used, then there's a warning like an entity with internal linkage cannot be referenced within an inline function with external linkage, and I presume this is not what we want (but still code compiles and produce no warnings).

Have you any ideas what to do with that? Actually, if our goal is warning-less build, we can just suppress the -Wstatic-reference-in-c99-inline-function warning, but it feels like a very dirty hack here.

makise-homura avatar Feb 08 '21 21:02 makise-homura

I'm merging some of this as 349da2b621f275e5ebc83fa6590235240821779a, 093b2c578cba4a8591de6a611818b3fa48d07430, 24ddeba55cf3bbfb014e79ea961ec201be1223ff. I'll publish a wip/e2k branch in the SIMDe repository with your changes rebased.

nemequ avatar Feb 09 '21 07:02 nemequ

You mean you see it compiled in the output? What compiler?

Yes, it was in LCC's assembler output.

Or is the problem that you still get an error because of the value passed to the imm8 parameter, even though that path isn't taken? If that's the case, what you can do is pass imm8 & 15

Yes! That worked. That was quite a clever hack. Pushed it in 42ecd54.

Also, I suggest something like (imm8 & ~15) instead of (imm8 > 15) because of negatives.

Hm, that may be an idea. Rewritten my 42ecd54 into 8b0b8fa, sorry for that previous commit.

makise-homura avatar Feb 10 '21 20:02 makise-homura

I've been playing around a bit with this, and I have a pretty small test case for the reduced-alignment issue:

#include <stdint.h>
#include <stdio.h>

typedef union {
  int8_t i8 __attribute__((__vector_size__(32)));
} simde__m256i;

simde__m256i
simde_mm256_set1_epi8(int8_t a) {
  simde__m256i r;
  for (size_t i = 0 ; i < sizeof(r.i8) / sizeof(r.i8[0]) ; i++) {
    r.i8[i] = a;
  }
  return r;
}

int main(void) {
  simde__m256i a = simde_mm256_set1_epi8(42);

  return 0;
}

This is interesting because we're not actually requesting a specific alignment anywhere; LCC aligns 256-bit vectors to 32-byte boundaries by default. I would definitely classify that as an LCC bug.

The other time I ran into this diagnostic today was when I looked into implementing a maximum alignment on LCC (not necessary, it turns out; it was just my first guess for the cause of this). In that case, if you specify the __attribute__((__aligned__(N))) with an N less than the default alignment for that type LCC will emit the alignment-reduced diagnostic.

IMHO it's not really appropriate to emit this as GCC's documentation says: "When used on a struct, or struct member, the aligned attribute can only increase the alignment; in order to decrease it, the packed attribute must be specified as well." It's not unreasonable to use it that way, as SIMDe does. However, I can see a reasonable argument for having it as long as it's off-by-default (which it is).

The real problem is that AFAICT there is no way to disable it in code; no diagnostic number is provided so I have no idea what to suppress. Of course, since lcc doesn't offer a way to pop the warning stack we would be disabling the warning in any code which uses SIMDe, too, which is obviously not optimal… if there were a way to pop the stack I'd be a lot more comfortable just disabling the warning for SIMDe.

I also took a look at the XOP performance. This was a bit more work than I'd hoped since google-benchmark doesn't work on LCC… there are a couple of warnings to suppress (IIRC one in google-benchmark and another in google-test), IIRC about unused functions), but then there is another problem I couldn't find a quick work-around for. However, I tried using Hayai it it worked well. So far I've only tested _mm_permute2_ps, but the "native" version is faster. Here is the code:

#include <sys/random.h>

#define SIMDE_NO_NATIVE
#include <simde/x86/xop.h>

#include <x86intrin.h>

class RandomVectorsFixture
    :   public ::hayai::Fixture
{
public:
    virtual void SetUp()
    {   
        getrandom(&a, sizeof(a), GRND_RANDOM);
        getrandom(&b, sizeof(b), GRND_RANDOM);
        getrandom(&c, sizeof(c), GRND_RANDOM);
    }

    virtual void TearDown()
    { }

    union {
      __m128 native;
      simde__m128 simde;
    } a;
    union {
      __m128 native;
      simde__m128 simde;
    } b;
    union {
      __m128i native;
      simde__m128i simde;
    } c;
    union {
      __m128 native;
      simde__m128 simde;
    } r;
};

BENCHMARK_F(RandomVectorsFixture, test_simde_mm_permute2_ps, 1000, 1) {
  r.simde = simde_mm_permute2_ps(a.simde, b.simde, c.simde, 2);
}

#pragma diag_suppress 1444

BENCHMARK_F(RandomVectorsFixture, test_mm_permute2_ps, 1000, 1) {
  r.native = _mm_permute2_ps(a.native, b.native, c.native, 2);
}

And the results:

l++ -o bench bench.cpp -L../hayai/build/src/ -I../hayai/src/ -I../simde -lhayai_main
[==========] Running 2 benchmarks.
[ RUN      ] RandomVectorsFixture.test_simde_mm_permute2_ps (1000 runs, 1 iteration per run)
[     DONE ] RandomVectorsFixture.test_simde_mm_permute2_ps (0.917222 ms)
[   RUNS   ]        Average time: 0.917 us (~0.058 us)
                    Fastest time: 0.564 us (-0.353 us / -38.510 %)
                    Slowest time: 1.316 us (+0.399 us / +43.477 %)
                     Median time: 0.923 us (1st quartile: 0.908 us | 3rd quartile: 0.932 us)

             Average performance: 1090248.59849 runs/s
                Best performance: 1773049.64539 runs/s (+682801.04690 runs/s / +62.62801 %)
               Worst performance: 759878.41945 runs/s (-330370.17903 runs/s / -30.30228 %)
              Median performance: 1083423.61863 runs/s (1st quartile: 1101321.58590 | 3rd quartile: 1072961.37339)

[ITERATIONS]        Average time: 0.917 us (~0.058 us)
                    Fastest time: 0.564 us (-0.353 us / -38.510 %)
                    Slowest time: 1.316 us (+0.399 us / +43.477 %)
                     Median time: 0.923 us (1st quartile: 0.908 us | 3rd quartile: 0.932 us)

             Average performance: 1090248.59849 iterations/s
                Best performance: 1773049.64539 iterations/s (+682801.04690 iterations/s / +62.62801 %)
               Worst performance: 759878.41945 iterations/s (-330370.17903 iterations/s / -30.30228 %)
              Median performance: 1083423.61863 iterations/s (1st quartile: 1101321.58590 | 3rd quartile: 1072961.37339)
[ RUN      ] RandomVectorsFixture.test_mm_permute2_ps (1000 runs, 1 iteration per run)
[     DONE ] RandomVectorsFixture.test_mm_permute2_ps (0.461097 ms)
[   RUNS   ]        Average time: 0.461 us (~0.046 us)
                    Fastest time: 0.170 us (-0.291 us / -63.131 %)
                    Slowest time: 0.876 us (+0.415 us / +89.982 %)
                     Median time: 0.467 us (1st quartile: 0.451 us | 3rd quartile: 0.467 us)

             Average performance: 2168741.06750 runs/s
                Best performance: 5882352.94118 runs/s (+3713611.87368 runs/s / +171.23353 %)
               Worst performance: 1141552.51142 runs/s (-1027188.55608 runs/s / -47.36336 %)
              Median performance: 2141327.62313 runs/s (1st quartile: 2217294.90022 | 3rd quartile: 2141327.62313)

[ITERATIONS]        Average time: 0.461 us (~0.046 us)
                    Fastest time: 0.170 us (-0.291 us / -63.131 %)
                    Slowest time: 0.876 us (+0.415 us / +89.982 %)
                     Median time: 0.467 us (1st quartile: 0.451 us | 3rd quartile: 0.467 us)

             Average performance: 2168741.06750 iterations/s
                Best performance: 5882352.94118 iterations/s (+3713611.87368 iterations/s / +171.23353 %)
               Worst performance: 1141552.51142 iterations/s (-1027188.55608 iterations/s / -47.36336 %)
              Median performance: 2141327.62313 iterations/s (1st quartile: 2217294.90022 | 3rd quartile: 2141327.62313)
[==========] Ran 2 benchmarks.

I'd like to test each function (shouldn't be hard to modify that code), but I suspect it's going to be better to just ignore that warning.

nemequ avatar Feb 16 '21 14:02 nemequ

0366dab69680125218a5e604e8e8d74ed346b0ff, e38fe50f5b1ede9f4a247196d414b899d4ba3a9f, and ad8c7e0723fb92d73324e5dd799ccdd41051251a move this along pretty well. With those patches in place I'm able to get to the point where the compilation fails due to the inefficient implementations. I'll try to get working on testing each of those tomorrow to make sure they are all faster than SIMDe's implementations.

e38fe50f5b1ede9f4a247196d414b899d4ba3a9f is an excellent example of why I don't trust __GNUC__ and friends on non-GCC compilers (clang is the exception, but they only claim compatibility with GCC 4.2.1). I didn't realize that was the problem until I saw the names you chose for the bugs, but now it makes a lot of sense.

nemequ avatar Feb 17 '21 01:02 nemequ

Sorry for long wait, I was a bit busy with work this week, so had any time to deal with SIMDe just today. I've rebased my changes on wip/e2k, seems to be better now.

The only thing remaining to do for now, I guess, is to do something with reduced alignment warnings (remove -Wno-reduced-alignment from build system). I've already filed a bug report to LCC developers, I hope they'll deal somehow with it, at least they'll say how to get it over with it gracefully with current version of LCC. Probably I'll file another bug reports for SIMDE_BUG_LCC_TOO_STRICT_VECTOR_SHIFTS_AND_COMPARES, SIMDE_BUG_LCC_XOP_MISSING, SIMDE_BUG_LCC_WARNING_ON_SHIFTS, SIMDE_BUG_LCC_FMA_WRONG_RESULT, and SIMDE_BUG_LCC_AVX_NO_LOAD_STORE_U2, for we'll be able to avoid these hacks in SIMDe when some future version of LCC is used.

Also I hope you approve the way I avoid deprecation warnings in 9ff51ed (with a tiny fix in 7791129), and give an advice on what to do with OPENMP_SIMD.

makise-homura avatar Feb 21 '21 01:02 makise-homura

No worries, I didn't have internet (except on my phone) for a while due to a move, so I wouldn't have been able to review this anyways.

I think I do like your idea for the deprecated diagnostics. I'm conflicted since I'm worried about someone doing something like

#pragma diag_suppress 1215,1444
// ...
#include "path/to/simde/x86/sse2.h"

call_deprecated_function();

But I think it's the best we're going to do for LCC (and a big improvement over just disabling them and leaving them that way), and unfortunately I don't think LCC is going to support push/pop any time soon.

and give an advice on what to do with OPENMP_SIMD.

You mean whether to just #define SIMDE_ENABLE_OPENMP #if defined(HEDLEY_MCST_LCC_VERSION)? I don't see a reason not to do so, do you? It might be a good idea to replace those !defined(SIMDE_ENABLE_OPENMP) checks in simde-common.h with !defined(SIMDE_ENABLE_OPENMP) && !defined(SIMDE_DISABLE_OPENMP) to give people the option of turning it off, but honestly I don't think that is necessary.

It seems like this is basically ready to merge, right? I see a few minor things I'd like to change, but nothing major; I'll just tweak those when I merge everything.

nemequ avatar Feb 23 '21 03:02 nemequ

I'm worried about someone doing something like

Yes, it is the exact case I'm expecting to be affected by using diag_suppress/diag_default. But I think, one who uses diag_suppress in some way in the start of file, might expect there can be diag_default somewhere, if he still get the suppressed warnings. And obviously it's a bit weird way (albeit legit for a "dirty hack") to suppress possible warnings in a single file rather than the whole build; one should use it for a piece of file that is problematic, not for the whole file. So the way I used it, is the best one of all possible ways in the case of not having push/pop, I guess.

It might be a good idea to replace those !defined(SIMDE_ENABLE_OPENMP) checks in simde-common.h with !defined(SIMDE_ENABLE_OPENMP) && !defined(SIMDE_DISABLE_OPENMP) to give people the option of turning it off

Yes, I did like this. So now OpenMP SIMD is enabled by default, and can be disabled if -DSIMDE_DISABLE_OPENMP is given. I edited README according to this also.

It seems like this is basically ready to merge, right?

Yes, now it's really ready to merge I guess. Today's commits fixed the every single thing remaining until E2K support could be considered implemented, so unless you have any change requests for it, I think these changes can land into master.

BTW, github still shows me that there's one more unresolved change requested, but I can't see it here.

makise-homura avatar Feb 25 '21 02:02 makise-homura

Sorry it took so long, but this is almost done. As of a few days ago everything (416c2437a35c83077eb004d5601cee7d9e3b2840 and 269db2ab456a7cb01ef606d732b718190d0a0f78) works on e2k, but the tests generate a lot of those -Wreduced-alignment warnings. For now you can get rid of them with -Wno-reduced-alignment and everything will work as expected.

I slightly changed how the calls to "deprecated" functions were made, mostly just wrapping them in statement exprs to squash the error, and I moved some stuff around a bit.

nemequ avatar Apr 30 '21 15:04 nemequ

Oops, sorry for long wait, I've totally forgot about this PR, and only few days ago I've been reminded of it. Actually I was like 'it is being merged, so it's all ok with it', and while I've had no further notifications, I forgot about it. Yeah, I guess -Wreduced-alignment is a specific of e2k, that might be fixed in some new versions of compiler (1.25.19 seems to have this already fixed, but I'm not sure of completeness), so it looks not a way to fix in SIMDe itself. Should then I fix something else, or we may proceed to final merge?

makise-homura avatar Dec 16 '21 16:12 makise-homura

bump! cc @nemequ @makise-homura

a1batross avatar Feb 22 '22 05:02 a1batross

Sorry, I've been away from SIMDe for a while but I'm trying to get back into it now. I'll take a look at this over the weekend.

My memory is a bit foggy on the details, but I seem to remember that everything except for one set of changes for one issue (-Wreduced-alignment false positive?) has been merged... if that's not necessary with the latest version of the compiler my inclination is to not work around it in SIMDe and just ask people to use the latest compiler, especially if it's just to silence a warning and not required to get the tests to compile and pass. For a niche architecture / compiler I think that's okay.

nemequ avatar Feb 23 '22 00:02 nemequ

Another bump :)

@nemequ @makise-homura

a1batross avatar Mar 04 '24 13:03 a1batross

Can someone point to place(s) where the supported instructions in the various e2k-v1/2/3/4/... are listed?

  • for Lintel (full emulation of a system, booting into standard x86 or x86-64 operating system like Windows)
  • for RTC (emulation suitable for x86 or x86-64 executables, booting into Elbrus Linux)

What I see in various other tickets and websites is conflicting, e.g. is AVX supported or not, which models support what, etc.

Torinde avatar Apr 09 '24 20:04 Torinde