deeplearning4j AVX2 brings no performance improvement?

Hey guys,

recently started using avx2 backend (as opposed to the generic x86-64), and to our surprise we saw no performance improvement whatsoever, in fact avx2 seemed to be ever so slightly slower. At first we thought that the problem is with our ML algorithms, meaning that it can not really take advantage of any sort of SIMD. But we ran some tests and found that the issue might not with our algorithms. Namely, we ran this unit test with and without avx2 extensions:

    @Test
    public void testMatrixMul() {
        int dim = 16384;
        INDArray a = Nd4j.createUninitialized(dim,dim);
        INDArray b = Nd4j.createUninitialized(dim,dim);
        Random r = new Random();
        for (int i = 0; i < dim; ++i) {
            for (int j = 0;  j< dim; ++j) {
                a.put(i, j, r.nextDouble());
                b.put(i, j, r.nextDouble());
            }
        }
        long millis = System.currentTimeMillis();
        a.mul(b);
        System.out.println(System.currentTimeMillis() - millis);
    }

Results: the multiplication with avx2 averaged 509 millis, while using the generic binaries gave 490 millis. I was expecting a 2-3x speed up for avx2 for a matrix multiplication of this size, but there's no indication of any sort of improvement, no matter how hard I tried.

Obviously, I was using the same setup for the two tests: same computer same configuration (except for the different backends). I was using 1.0.0-M1.1 version.

Any thoughts on this?

Aug 09 '21 08:08 pza94

@pza94 I would first do a warmup and use something like JMH for benchmarking to avoid the warmup problem. I would also advise trying onednn and seeing if that's relevant. Certain smaller problems may not see much of an improvement either..

Beyond that, could you mention the CPU you're running on? You may not see benefits depending on that either.

Aug 09 '21 09:08 agibsonccc

I ran this test on a AMD Ryzen 5 3500U CPU https://www.amd.com/en/products/apu/amd-ryzen-5-3500u. As for the size of the problem, a matrix multiplication of dim 16k is definitely in the territory where you should see improvements.

Do you guys have some benchmarks for AVX2 vs generic binaries in Nd4j, btw? I totally understand that the above test is insufficient for any benchmarking purposes, but I would still expect a proper speed up. Have you guys ever experienced any sort of performance improvement?

Is it possible that I need further configuration, in the OS, hardware, JVM wherever? I checked the docs I didn't find anything beside just setting system props such as -Dorg.bytedeco.javacpp.platform.extension=-avx2.

Aug 09 '21 09:08 pza94

Zen supports AVX2 but it requires two clock cycles to complete each AVX2 instruction compared to Intel's one.[70][71] This difference was corrected in Zen 2.

The 3500U is based on the first-gen Zen architecture, or rather Zen+ which is just a process node improvement, but not a new architecture.

This is likely the reason why you don't see a lot of benefit from switching to AVX2.

Aug 09 '21 09:08 treo

@treo Interesting, could you please link the source of this info?

My understanding of AVX2 clearly seems limited. Does this basically mean, that some CPUs are "just" compatible with AVX2, meaning that they can interpret and execute the instructions but brings no improvement in actual execution time, depending on the implementation?

Aug 09 '21 09:08 pza94

That particular quote is from https://en.wikipedia.org/wiki/Zen_(first_generation_microarchitecture)#Performance

And yes, there are cases where due to the way it is implemented, it is just compatible enough to execute those binaries, but it doesn't really have any benefit.

Aug 09 '21 09:08 treo

Okay guys, thanks for the clarification. I would probably like to suggest that you mention this topic in the docs. We had just a very vague idea of AVX2, but we kept seeing the warning message when loading nd4j, the we should upgrade to AVX2, and were just surprised to see no benefits of doing so.

There are probably others who can run into this issue.

Aug 09 '21 10:08 pza94

Update: I had my colleague run this same test on a AMD Ryzen 7 3700X. (that is Zen2). He experienced also no improvement at all.

Aug 09 '21 10:08 pza94

@pza94 can you try including onednn-avx and see if that helps? The issue might be with openblas pre picking avx in any of the binaries despite what libnd4j's been compiled with. Would be good to verify some of this. Thanks for the help!

Aug 09 '21 10:08 agibsonccc

I can confirm that on my Ryzen 9 5900X (Zen 3) it also doesn't seems to make a difference.

@pza94 Thank you for highlighting the issue. We'll try to figure out why exactly that is happening even on systems where there should be a proper speedup.

Aug 09 '21 11:08 treo

we tried ondenn as well, same result. Note I just realized I used, mul (pointwise) instead of mmul, but that wasn't the problem, we tested everything with mmul, still having the same outcome.

Aug 09 '21 11:08 pza94

libnd4j always tries to delegate some ops to well-optimized platform kernels.. that matrix multiplication should be delegated to third-party/platform BLAS kernels(in this case it's mostly OpenBLAS). also, check performance with OMP_NUM_THREADS=1. if you see such issues with other ops. then I could investigate if that op was properly auto vectorized or not. (we mostly relying on auto-vectorization with avx2. there might be the places or builds where the compiler was not able to auto vectorize properly)

Aug 09 '21 11:08 quickwritereader

just checked the generic case for elementwise multiplication (the example that @pza94 provided) as it is calculated by libnd4j itself. I got vectorized codes(~~avx~~ sse) there as well. ~~Therefore no difference between avx/avx2 one and the generic one while using floating-point~~.

#my maven config
<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-native</artifactId>
    <version>1.0.0-M1.1</version>
    <classifier>linux-x86_64</classifier>
</dependency>

 #run perf to see the instructions
 perf record java -jar target/ml-1.0.0-SNAPSHOT-shaded.jar

Aug 09 '21 15:08 quickwritereader

correction. the generic one is using sse mulps . the avx one is using vmulps but GCC adds extra instructions (probably because of some faulty chip arch) there which might be the reason

adding -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store flags should handle it. but we will get a little speed up as the mul is memory bound operation and besides, our generic one is SSE vectorized as well

Aug 09 '21 23:08 quickwritereader

well adding -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store with 32byte alignment did not give any noticeable changes on AMD CPU(Threadripper 3970X). as the op is memory bound it is understandable. ~~but what surprised me is that when using workspace libnd4j can parallelize better than normal usage. that's is an odd situation.~~

Aug 11 '21 20:08 quickwritereader

I just upgraded my 7 year-old Intel 4790k processor to a new Ryzen 5700g and I see training taking 18x longer while in general this CPU is 2.5x faster than the old one. I used the AVX2 classifier and it didn't speed it up.

Aug 14 '21 01:08 daviddbal

@daviddbal were using the same version of dl4j on intel as well?

Aug 14 '21 01:08 quickwritereader

Yes. 1.0.0-beta7

Aug 14 '21 01:08 daviddbal

@daviddbal sorry for asking again? so you were using 1.0.0-beta7 on both and there was performance degradation, right? and you were using openblas or mkl one? intel sometimes does crazy things there

Aug 14 '21 02:08 quickwritereader

@daviddbal @pza94 we'll be doing a follow up release in the next week or so to address some of this. We need to chase down the root cause of this and clearly document the trade offs after we do some benchmarking and running on different cpus. Thanks for chasing this down with us!

Aug 14 '21 02:08 agibsonccc

Yes, 1.0.0-beta7 on both machines. I'm now trying 1.0.0-M1.1, but I'm having trouble with my gradle build file:

I'm getting this error with 1.0.0-M1.1 Caused by: java.lang.UnsatisfiedLinkError: /home/bal/.javacpp/cache/nd4j-native-1.0.0-M1.1-linux-x86_64.jar/org/nd4j/nativeblas/linux-x86_64/libjnind4jcpu.so: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.22’ not found (required by /home/bal/.javacpp/cache/nd4j-native-1.0.0-M1.1-linux-x86_64.jar/org/nd4j/nativeblas/linux-x86_64/libnd4jcpu.so)

Honestly, I didn't specify openblas or mkl on the Intel machine. Unfortunately, I didn't preserve any of the log outputs that would show me what was selected by default. I'm only now learning about those details since I now see such a huge performance degradation. For certain, I didn't set the MKL environment variable. I didn't specify AVX2 as a classifier.

Aug 14 '21 02:08 daviddbal

@daviddbal openblas comes with nd4j-native as well as a transitive dependency. You can see the comprehensive list of dependencies here: https://github.com/eclipse/deeplearning4j/blob/master/nd4j/nd4j-backends/nd4j-backend-impls/nd4j-native-platform/pom.xml

If you have issues with using openblas or mkl, you can also include those as dependencies as what's found in nd4j-native-platform.

As mentioned though, openblas is already there. You can see that here: https://github.com/eclipse/deeplearning4j/blob/master/nd4j/nd4j-backends/nd4j-backend-impls/nd4j-native/pom.xml#L60

Aug 14 '21 02:08 agibsonccc

Adding link to other thread for posterity just in case this comes up in future search results: https://community.konduit.ai/t/amd-ryzen-5000-cpu-poor-performance/1554/13

Aug 14 '21 02:08 agibsonccc

Just noting also for folks who aren't familiar with this problem, see an example fix by matlab: https://www.extremetech.com/computing/308501-crippled-no-longer-matlab-2020a-runs-amd-cpus-at-full-speed https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_to_use_a_fast_codepath_on_amd/

In essence, the solution could be just adding:

export MKL_DEBUG_CPU_TYPE=5

as a workaround. We will try to figure out a transparent fix for this so the user doesn't have to use a workaround.

Aug 14 '21 02:08 agibsonccc

I set the environment variable on my AMD Ryzen compute: export MKL_DEBUG_CPU_TYPE=5 No effect. Still ~20x slower in training than my 7 year old Intel CPU.

Aug 14 '21 18:08 daviddbal

@pza94 @daviddbal

@quickwritereader and I dove in to this and found that onednn was not being invoked properly. We did find some minor speedups on AMD software but didn't do much yet. Instead, I focused on properly updating the c++ codebase to use a more recent (2.3.x) version of onednn that doesn't have any of the performance issues with AMD. You can see that pull request here: https://github.com/eclipse/deeplearning4j/pull/9423

Once this is merged, I would suggest taking snapshots for a spin. You can see a sample article here for AMD processors: https://www.phoronix.com/scan.php?page=news_item&px=Intel-oneDNN-2.2

When using GEMM with AMD processors, remember to look for USE_ONEDNN as one of the headers in addition to the other signals.

Aug 15 '21 13:08 agibsonccc

@pza94 @daviddbal we'll be doing a release soon and have just merged a pull request that addresses the visibility of the helpers to ensure they are being used. We will be producing follow up documentation and resolutions to this shortly. We've done the same for cuda and arm as well to ensure that any platform helpers can have clear confirmation they are being executed.

Sep 08 '21 13:09 agibsonccc

Seems like this has been fixed in subsequent versions.

Oct 27 '22 04:10 agibsonccc

deeplearning4j deeplearning4j copied to clipboard

AVX2 brings no performance improvement?

deeplearning4j
deeplearning4j copied to clipboard