blis slow generic implementation

I was assuming that BLIS is generally better than reference BLAS, so substituting the latter with BLIS OS packages I'm working on would always be sensible. However, I found BLIS is more than two times slower for medium-sized dgemm on x86_64/RHEL7 for a "generic" build compared with the system reference blas package (which should be built with -O2 -mtune=generic, not -O3). I can't usefully test an architecture without a tuned implementation, but I don't see any reason to think that would be much different, though I haven't looked into the gcc optimization.

Is that expected, or might it be something worth investigating?

Sep 28 '18 14:09 loveshack

The generic implementation will have better cache behavior than netlib BLAS, but will also do packing which will slow things down for small and medium-sized matrices. It's not totally clear from your comment whether or not this is the configuration that BLIS is using, please correct me if I am mistaken.

Sep 28 '18 15:09 devinamatthews

@devinamatthews It may also be that Fortran is better than C :trollface:

Sep 28 '18 15:09 jeffhammond

You wrote:

The generic implementation will have better cache behavior than netlib BLAS,

That's what I thought.

but will also do packing which will slow things down for small and medium-sized matrices.

but not about that. At what sort of size would that stop hurting (and I wonder if it could usefully be adaptive)? I tried 2000×2000 to run a few goes in a reasonable time. I've just tried 4000 square, which looks about the same.

It's not totally clear from your comment whether or not this is the configuration that BLIS is using, please correct me if I am mistaken.

I built a BLIS dynamic library with default flags and the generic target. I took the openblas dgemm benchmark (which actually linked against openblas), and ran it with either BLIS or reference BLAS LD_PRELOADed. Is that clearer?

I could examine the compilation results and profiles at some stage when I have more time, but thought it was worth asking the experts first -- thanks.

Sep 28 '18 17:09 loveshack

You wrote:

@devinamatthews It may also be that Fortran is better than C :trollface:

Of course, but a sometime GNU Fortran maintainer knows how :-/.

Sep 28 '18 17:09 loveshack

OK, I guess I'm not really clear why you care about the performance of the BLIS generic configuration. Even with cache blocking it will never be "high performance".

Sep 28 '18 17:09 devinamatthews

At least it is true that the builds on non-x86_64 architectures are slow due to the slow tests. https://launchpad.net/~lumin0/+archive/ubuntu/ppa/+sourcepub/9451410/+listing-archive-extra Click on the builds and there is time elapsed for the whole compiling+testing process.

Sep 29 '18 01:09 cdluminate

@cdluminate I took a look at some of the build times, as you suggest. It is true that the build time is excessive for your s390x build, for example (50 minutes, if I'm reading the output correctly). Much of that can be attributed to the fact that we do not have optimized kernels for every architecture. s390x is one of those unoptimized architectures. Still, this does feel a bit slow.

(Digression: If you would like to reduce the total build time, I recommend running the "fast" version of the BLIS testsuite, which is almost surely where most of the time is being spent. Right now, make test triggers the BLAS test drivers + the full BLIS testsuite. You can instead use make check, which runs the BLAS test drivers + a shortened version of the BLIS testsuite.)

However, strangely, your amd64 build still requires almost 19 minutes. That is still quite long. I just did a quick test on my 3.6GHz Broadwell. Targeting x86_64 at configure-time, I found that:

The library build itself takes only 55 seconds.
The full BLIS testsuite (build and run) takes about 3 minutes.
The BLAS test drivers (build and run) add another 10 seconds. Note that no multithreading was used during the execution of any of the BLAS test drivers or BLIS testsuite, though all compilation was done with the -j4 argument to make.

Perhaps your build hardware for the amd64 build is old? Or maybe oversubscribed?

An unrelated question: I assume that the name of your amd64 build refers generically to "the build for x86_64 microarchitectures," as it does in the Gentoo Linux world, and not AMD-specific hardware. Am I correct?

Sep 29 '18 21:09 fgvanzee

Debian tries to help upstream spot problems, not to build software as fast as possible. In order to build a reliable linux distribution it's not a good idea to skip too much tests. Hence the full testsuite is preferred for packaging.

As for amd64 build, my Intel I5-7440HQ runs the full test quite fast too. It's possible that Ubuntu uses old x86-64 machine in their buildfarm, but I'm not sure "old hardware" is the cause of "20 min" build time.

Debian's term amd64 always equals to x86_64. No matter what brand the physical CPU is.

Sep 30 '18 01:09 cdluminate

Debian tries to help upstream spot problems, not to build software as fast as possible. In order to build a reliable linux distribution it's not a good idea to skip too much tests. Hence the full testsuite is preferred for packaging.

That's fine. I often prefer the full testsuite in my own development, too, but I thought I would offer the faster alternative since many people in the past have been happy with avoiding many tests that are nearly identical to each other if it saves them 5-10x time.

As for amd64 build, my Intel I5-7440HQ runs the full test quite fast too. It's possible that Ubuntu uses old x86-64 machine in their buildfarm, but I'm not sure "old hardware" is the cause of "20 min" build time.

I'm glad you also see more normal build times. I see no need to worry, then, about the 20 minute build time on the Debian build hardware.

Debian's term amd64 always equals to x86_64. No matter what brand the physical CPU is.

Good, that's what I thought/expected. Thanks.

Sep 30 '18 01:09 fgvanzee

I'm glad you also see more normal build times. I see no need to worry, then, about the 20 minute build time on the Debian build hardware.

Just nitpicking: The launchpad, or PPA is Ubuntu's infrastructure, supported by business company Canonical. Debian is supported by independent community that theoretically doesn't rely on Ubuntu or Canonical.

The pages you see are not powered by Debian's build hardware. What I'm doing there is abusing Ubuntu's free build machines to build stuff on Ubuntu cosmic for testing Debian packages. (Ubuntu cosmic, or Ubuntu 18.10 is very close to Debian unstable. So testing Debian packages on Ubuntu machine sometimes makes sense).

Sep 30 '18 01:09 cdluminate

Just nitpicking: ...

Unlike most people, I will almost never be bothered by nitpicking! I like and appreciate nuance. :) Thanks for those details.

Sep 30 '18 01:09 fgvanzee

BTW, since I don't use Debian, I have to rely on people like you and @nschloe for your expertise on these topics (understanding how we fit into the Debian/Ubuntu universes). Thanks again.

Sep 30 '18 01:09 fgvanzee

Field:

Next time a vendor offers to donate hardware, you might ask for a big SSD so you can setup a virtual machine for every Linux distro. Just a thought.

Sep 30 '18 01:09 jeffhammond

@jeffhammond In principle, I agree with you. However, this is the sort of thing that is not as practical now that our group is so small. (It also doesn't help that maintaining machines in our department comes with a non-trivial amount of cost and of red tape.) Instead, I'm going to channel you circa 2010 and say, "we look forward to your patch." And by that I mean, "someone doing it for us."

Sep 30 '18 19:09 fgvanzee

@loveshack Returning to the original question: I think one way to make the "generic" implementation faster would be to add a fully-unrolled branch and temporary storage of C to the kernel, e.g.:

...
if (m == MR && n == NR)
{
    // unroll all MR*NR FMAs into temporaries
}
else
{
    // as usual
}
...
// accumulate at the end instead of along the way

and arrange for the reference kernel to be compiled with architecture-appropriate flags. The second issue means that e.g. a configuration without an optimized kernel would possibly run faster because of auto-vectorization, but that the actual generic configuration will probably still be very slow because it gets very conservative compiler flags.

Oct 01 '18 14:10 devinamatthews

You wrote:

Debian tries to help upstream spot problems, not to build software as fast as possible. In order to build a reliable linux distribution it's not a good idea to skip too much tests. Hence the full testsuite is preferred for packaging.

For what it's worth, that's not what's normally done for Fedora. On the slower build platforms it would likely time out, and can perturb mass rebuilds considerably. I consider the "check" step in rpm builds basically as a sanity check, especially as in cases like this you can't test a relevant range of micro-architectures. [The make check target was added for that, but I also test the Fortran interface with gfortran, rather than relying on the f2c'd versions.]

For Fedora, I don't care about build times unless they're pathological, especially as they're very variable on the build VMs.

Debian's term amd64 always equals to x86_64. No matter what brand the physical CPU is.

[And for confusion, Fedora just uses x86_64 (which is probably less correct).]

Oct 01 '18 15:10 loveshack

You wrote:

OK, I guess I'm not really clear why you care about the performance of the BLIS generic configuration. Even with cache blocking it will never be "high performance".

This is for OS packaging purposes. I assumed I could say that using BLIS would be strictly better than reference BLAS, i.e. the reference blas package is redundant for any platforms not supported by the blis or openblas packages (apart from compatibility tests).

Oct 01 '18 15:10 loveshack

You wrote:

Field:

Next time a vendor offers to donate hardware, you might ask for a big SSD so you can setup a virtual machine for every Linux distro. Just a thought.

For what it's worth, I frequently spin up VMs with vagrant, which is mostly practical at least up to a cluster of three or so, on an 8GB/HDD laptop.

However, it's reasonable to leave specific distribution work to packagers, as long as the basic build system doesn't put obstacles in the way, and I think we've already got the relevant hooks like xFLAGS. thanks.

Also for what it's worth, I've tested rpm packaging for SuSE in the configurations supported by Fedora's copr as well as for the range of supported RHEL/Fedora targets, and my amd64 Debian desktop.

Oct 01 '18 15:10 loveshack

You wrote:

@loveshack Returning to the original question: I think one way to make the "generic" implementation faster would be to add a fully-unrolled branch and temporary storage of C to the kernel, e.g.:
...
if (m == MR && n == NR)
{
    // unroll all MR*NR FMAs into temporaries
}
else
{
    // as usual
}
...
// accumulate at the end instead of along the way
and arrange for the reference kernel to be compiled with architecture-appropriate flags. The second issue means that e.g. a configuration without an optimized kernel would possibly run faster because of auto-vectorization, but that the actual generic configuration will probably still be very slow because it gets very conservative compiler flags.

I haven't had a chance to investigate further, but I did find that building generic with -march=native -Ofast -funroll-loops doesn't make a dramatic difference, not that -march=native can be used for packaging anyhow. (Part of the reason I expected BLIS to do better is that the -O3 it uses enables vectorization -- though only sse2 with -march=generic -- c.f. -O2 used for the reference blas package.) Then, I've never understood why compilers do so badly on, say, matmul.

Oct 03 '18 13:10 loveshack

@loveshack What architectures in particular are you having a problem with?

Oct 03 '18 14:10 devinamatthews

and arrange for the reference kernel to be compiled with architecture-appropriate flags. The second issue means that e.g. a configuration without an optimized kernel would possibly run faster because of auto-vectorization, but that the actual generic configuration will probably still be very slow because it gets very conservative compiler flags.

@devinamatthews It's not clear from context if you were under the impression that reference kernels were not already compiled with architecture-specific flags, but indeed they are. (Or maybe you are referring to a different variety of flags than I am.) Either way, make V=1 would confirm.

Or did you mention architecture-specific flags because you knew that @loveshack could not use -march=native and the like for packaging purposes?

Oct 03 '18 17:10 fgvanzee

@fgvanzee I was mostly talking about the actual generic configuration vs. the reference kernel being used in a particular configuration.

Oct 03 '18 18:10 devinamatthews

@devinamatthews Ah, makes sense. Thanks for clarifying. Yeah, generic doesn't do jack except use -O3, which I'm guessing in our world doesn't do much either.

Oct 03 '18 18:10 fgvanzee

It might be interesting to see if simd pragmas cause anything better to happen with the reference kernel. I’ve got a list of all of those, in addition to the obvious OpenMP one.

Oct 05 '18 02:10 jeffhammond

You wrote:

@loveshack What architectures in particular are you having a problem with?

The Fedora architectures that BLIS doesn't support I think are i686, ppc64, ppc64le, and s390x; there will be more in Debian. (OpenBLAS does those apart from ppc64, so we can at least use a free BLAS on most Fedora architectures.)

Oct 05 '18 14:10 loveshack

You wrote:

@devinamatthews Ah, makes sense. Thanks for clarifying. Yeah, generic doesn't do jack except use -O3, which I'm guessing in our world doesn't do much either.

Yes, it doesn't make much difference experimentally (on x86_64), but you might expect it to help by including vectorization.

Oct 05 '18 14:10 loveshack

You wrote:

It might be interesting to see if simd pragmas cause anything better to happen with the reference kernel. I’ve got a list of all of those, in addition to the obvious OpenMP one.

Yes, but I guess the first thing to do is to consult a detailed profile and gcc's optimization report. I'll have a look at it eventually, but I don't know whether results from x86_64 would be representative of other architectures I can't currently access. (I'll try to get on aarch64 and power8 at some stage.)

Oct 05 '18 14:10 loveshack

i686, ppc64, ppc64le, and s390x

@loveshack For which of those architectures can we assume vectorization with the default flags?

Oct 05 '18 17:10 devinamatthews

Yes, it doesn't make much difference experimentally (on x86_64), but you might expect it to help by including vectorization.

I might be willing to add such a flag or flags if you can recommend some that are relatively portable. And ideally, you would tell me the analogues of such flags on clang and icc, if applicable.

Oct 06 '18 18:10 fgvanzee

@fgvanzee I would suggest:

Changing the default MR and NR to 4x16, 4x8, 4x8, 4x4 (sdcz).
Rewriting the reference gemm kernel to: a. be row-major, b. be fully unrolled in the k loop (this means you wouldn't be able to change MR/NR without writing a custom kernel but that seems reasonable), c. use temporary variables for C, and d. use restrict.
Adding configurations for whatever is missing for packaging (s390x, ppc64, etc.) to get at least baseline vectorization flags for the reference kernels.

Rationale: rewriting the reference kernel this way should allow for a reasonable degree of auto-vectorization given the right flags. The larger kernels size and row-major layout would allow for 128b and 256b vectorization with a higher bandwidth from L1 than L2. I measure up to a 6x increase in performance for AVX2 in a quick mock test.

Oct 06 '18 22:10 devinamatthews