deeplearning4j
deeplearning4j copied to clipboard
AVX2 brings no performance improvement?
Hey guys,
recently started using avx2 backend (as opposed to the generic x86-64), and to our surprise we saw no performance improvement whatsoever, in fact avx2 seemed to be ever so slightly slower. At first we thought that the problem is with our ML algorithms, meaning that it can not really take advantage of any sort of SIMD. But we ran some tests and found that the issue might not with our algorithms. Namely, we ran this unit test with and without avx2 extensions:
@Test
public void testMatrixMul() {
int dim = 16384;
INDArray a = Nd4j.createUninitialized(dim,dim);
INDArray b = Nd4j.createUninitialized(dim,dim);
Random r = new Random();
for (int i = 0; i < dim; ++i) {
for (int j = 0; j< dim; ++j) {
a.put(i, j, r.nextDouble());
b.put(i, j, r.nextDouble());
}
}
long millis = System.currentTimeMillis();
a.mul(b);
System.out.println(System.currentTimeMillis() - millis);
}
Results: the multiplication with avx2 averaged 509 millis, while using the generic binaries gave 490 millis. I was expecting a 2-3x speed up for avx2 for a matrix multiplication of this size, but there's no indication of any sort of improvement, no matter how hard I tried.
Obviously, I was using the same setup for the two tests: same computer same configuration (except for the different backends). I was using 1.0.0-M1.1 version.
Any thoughts on this?
@pza94 I would first do a warmup and use something like JMH for benchmarking to avoid the warmup problem. I would also advise trying onednn and seeing if that's relevant. Certain smaller problems may not see much of an improvement either..
Beyond that, could you mention the CPU you're running on? You may not see benefits depending on that either.
I ran this test on a AMD Ryzen 5 3500U CPU https://www.amd.com/en/products/apu/amd-ryzen-5-3500u. As for the size of the problem, a matrix multiplication of dim 16k is definitely in the territory where you should see improvements.
Do you guys have some benchmarks for AVX2 vs generic binaries in Nd4j, btw? I totally understand that the above test is insufficient for any benchmarking purposes, but I would still expect a proper speed up. Have you guys ever experienced any sort of performance improvement?
Is it possible that I need further configuration, in the OS, hardware, JVM wherever? I checked the docs I didn't find anything beside just setting system props such as -Dorg.bytedeco.javacpp.platform.extension=-avx2
.
Zen supports AVX2 but it requires two clock cycles to complete each AVX2 instruction compared to Intel's one.[70][71] This difference was corrected in Zen 2.
The 3500U is based on the first-gen Zen architecture, or rather Zen+ which is just a process node improvement, but not a new architecture.
This is likely the reason why you don't see a lot of benefit from switching to AVX2.
@treo Interesting, could you please link the source of this info?
My understanding of AVX2 clearly seems limited. Does this basically mean, that some CPUs are "just" compatible with AVX2, meaning that they can interpret and execute the instructions but brings no improvement in actual execution time, depending on the implementation?
That particular quote is from https://en.wikipedia.org/wiki/Zen_(first_generation_microarchitecture)#Performance
And yes, there are cases where due to the way it is implemented, it is just compatible enough to execute those binaries, but it doesn't really have any benefit.
Okay guys, thanks for the clarification. I would probably like to suggest that you mention this topic in the docs. We had just a very vague idea of AVX2, but we kept seeing the warning message when loading nd4j, the we should upgrade to AVX2, and were just surprised to see no benefits of doing so.
There are probably others who can run into this issue.
Update: I had my colleague run this same test on a AMD Ryzen 7 3700X. (that is Zen2). He experienced also no improvement at all.
@pza94 can you try including onednn-avx and see if that helps? The issue might be with openblas pre picking avx in any of the binaries despite what libnd4j's been compiled with. Would be good to verify some of this. Thanks for the help!
I can confirm that on my Ryzen 9 5900X (Zen 3) it also doesn't seems to make a difference.
@pza94 Thank you for highlighting the issue. We'll try to figure out why exactly that is happening even on systems where there should be a proper speedup.
we tried ondenn as well, same result. Note I just realized I used, mul (pointwise) instead of mmul, but that wasn't the problem, we tested everything with mmul, still having the same outcome.
libnd4j always tries to delegate some ops to well-optimized platform kernels.. that matrix multiplication should be delegated to third-party/platform BLAS kernels(in this case it's mostly OpenBLAS). also, check performance with OMP_NUM_THREADS=1. if you see such issues with other ops. then I could investigate if that op was properly auto vectorized or not. (we mostly relying on auto-vectorization with avx2. there might be the places or builds where the compiler was not able to auto vectorize properly)
just checked the generic case for elementwise multiplication (the example that @pza94 provided) as it is calculated by libnd4j itself. I got vectorized codes(~~avx~~ sse) there as well. ~~Therefore no difference between avx/avx2 one and the generic one while using floating-point~~.
#my maven config
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-native</artifactId>
<version>1.0.0-M1.1</version>
<classifier>linux-x86_64</classifier>
</dependency>
#run perf to see the instructions
perf record java -jar target/ml-1.0.0-SNAPSHOT-shaded.jar
correction. the generic one is using sse mulps . the avx one is using vmulps
but GCC adds extra instructions (probably because of some faulty chip arch) there which might be the reason
adding -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store flags should handle it. but we will get a little speed up as the mul is memory bound operation and besides, our generic one is SSE vectorized as well
well adding -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store
with 32byte alignment did
not give any noticeable changes on AMD CPU(Threadripper 3970X). as the op is memory bound it is understandable.
~~but what surprised me is that when using workspace libnd4j can parallelize better than normal usage.
that's is an odd situation.~~
I just upgraded my 7 year-old Intel 4790k processor to a new Ryzen 5700g and I see training taking 18x longer while in general this CPU is 2.5x faster than the old one. I used the AVX2 classifier and it didn't speed it up.
@daviddbal were using the same version of dl4j on intel as well?
Yes. 1.0.0-beta7
@daviddbal sorry for asking again? so you were using 1.0.0-beta7 on both and there was performance degradation, right? and you were using openblas or mkl one? intel sometimes does crazy things there
@daviddbal @pza94 we'll be doing a follow up release in the next week or so to address some of this. We need to chase down the root cause of this and clearly document the trade offs after we do some benchmarking and running on different cpus. Thanks for chasing this down with us!
Yes, 1.0.0-beta7 on both machines. I'm now trying 1.0.0-M1.1, but I'm having trouble with my gradle build file:
I'm getting this error with 1.0.0-M1.1 Caused by: java.lang.UnsatisfiedLinkError: /home/bal/.javacpp/cache/nd4j-native-1.0.0-M1.1-linux-x86_64.jar/org/nd4j/nativeblas/linux-x86_64/libjnind4jcpu.so: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.22’ not found (required by /home/bal/.javacpp/cache/nd4j-native-1.0.0-M1.1-linux-x86_64.jar/org/nd4j/nativeblas/linux-x86_64/libnd4jcpu.so)
Honestly, I didn't specify openblas or mkl on the Intel machine. Unfortunately, I didn't preserve any of the log outputs that would show me what was selected by default. I'm only now learning about those details since I now see such a huge performance degradation. For certain, I didn't set the MKL environment variable. I didn't specify AVX2 as a classifier.
@daviddbal openblas comes with nd4j-native as well as a transitive dependency. You can see the comprehensive list of dependencies here: https://github.com/eclipse/deeplearning4j/blob/master/nd4j/nd4j-backends/nd4j-backend-impls/nd4j-native-platform/pom.xml
If you have issues with using openblas or mkl, you can also include those as dependencies as what's found in nd4j-native-platform.
As mentioned though, openblas is already there. You can see that here: https://github.com/eclipse/deeplearning4j/blob/master/nd4j/nd4j-backends/nd4j-backend-impls/nd4j-native/pom.xml#L60
Adding link to other thread for posterity just in case this comes up in future search results: https://community.konduit.ai/t/amd-ryzen-5000-cpu-poor-performance/1554/13
Just noting also for folks who aren't familiar with this problem, see an example fix by matlab: https://www.extremetech.com/computing/308501-crippled-no-longer-matlab-2020a-runs-amd-cpus-at-full-speed https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_to_use_a_fast_codepath_on_amd/
In essence, the solution could be just adding:
export MKL_DEBUG_CPU_TYPE=5
as a workaround. We will try to figure out a transparent fix for this so the user doesn't have to use a workaround.
I set the environment variable on my AMD Ryzen compute: export MKL_DEBUG_CPU_TYPE=5 No effect. Still ~20x slower in training than my 7 year old Intel CPU.
@pza94 @daviddbal
@quickwritereader and I dove in to this and found that onednn was not being invoked properly. We did find some minor speedups on AMD software but didn't do much yet. Instead, I focused on properly updating the c++ codebase to use a more recent (2.3.x) version of onednn that doesn't have any of the performance issues with AMD. You can see that pull request here: https://github.com/eclipse/deeplearning4j/pull/9423
Once this is merged, I would suggest taking snapshots for a spin. You can see a sample article here for AMD processors: https://www.phoronix.com/scan.php?page=news_item&px=Intel-oneDNN-2.2
When using GEMM with AMD processors, remember to look for USE_ONEDNN as one of the headers in addition to the other signals.
@pza94 @daviddbal we'll be doing a release soon and have just merged a pull request that addresses the visibility of the helpers to ensure they are being used. We will be producing follow up documentation and resolutions to this shortly. We've done the same for cuda and arm as well to ensure that any platform helpers can have clear confirmation they are being executed.
Seems like this has been fixed in subsequent versions.