blis icon indicating copy to clipboard operation
blis copied to clipboard

Performance issue on Ookami

Open QingleiCao opened this issue 3 years ago • 12 comments

I tested Blis using dgemm on Ookami, but the performance for 1 core is less than 50 Gflop/s on matrix with size up to 5000 * 5000. Are there any tips? I followed the steps on website, which shows about 70 Gflop/s per core could be achieved.

QingleiCao avatar Mar 01 '22 15:03 QingleiCao

@xrq-phys any tips?

devinamatthews avatar Mar 01 '22 16:03 devinamatthews

Sorry for the late reply.

SC Fugaku runs at 2.2GHz while A64FX chip itself supports 1.6, 1.8, 2.0 and 2.2GHz frequencies. Perhaps that's the reason?

xrq-phys avatar Mar 09 '22 17:03 xrq-phys

This says 1.8 GHz.

devinamatthews avatar Mar 09 '22 17:03 devinamatthews

Also, @QingleiCao have you tried SSL2 for comparison?

devinamatthews avatar Mar 09 '22 17:03 devinamatthews

Thanks. It's still a bit strange then. 1.8GHz should yield ~53GFLOPS/sec instead of a bit lower than 50. And M=N=K=5000 should not hit any upper side limit.

xrq-phys avatar Mar 09 '22 17:03 xrq-phys

Thanks for your explanation. Yes, the performance of SSL2 is similar as shown in this website, about 64 Gflop/s on a single core and about 60 Gflop/s per core if on 48 cores using OpenMP.

Actually, the problem we are facing with SSL2 is when using pthread instead of OpenMP, which needs to disable the sector cache, leading to about a 20% ~ 30% decrease in performance.

Right now, there are some issues with our Fugaku account. I will test blis on Fugaku later and get you guys posted.

BTW, do you guys have some performance results about blis using pthread? I tested DGEMM of a single node on Ookami from DPLASMA which links to blis, but it could only get about 1450 Gflop/s per node (matrix size up to 25,000 with a bunch of tile sizes, 320, 360, 400, 600 and 800).

QingleiCao avatar Mar 11 '22 20:03 QingleiCao

BTW, do you guys have some performance results about blis using pthread? I tested DGEMM of a single node on Ookami from DPLASMA which links to blis, but it could only get about 1450 Gflop/s per node (matrix size up to 25,000 with a bunch of tile sizes, 320, 360, 400, 600 and 800).

BLIS performance with pthreads is usually not very good except for very large matrices since we don't maintain a thread pool. If DPLASMA is handling parallelism then you could just use single-threaded BLIS.

devinamatthews avatar Mar 11 '22 20:03 devinamatthews

Thanks for your quick reply. Yes, that's the configuration in DPLASMA, coupling single-threaded BLIS kernel and pthread. Is it good if calling many single-threaded kernels in BLIS, e.g., dgemm, simultaneously?

QingleiCao avatar Mar 11 '22 20:03 QingleiCao

There shouldn't be any specific problems with that. If you link SSL2 does it get better performance?

devinamatthews avatar Mar 11 '22 20:03 devinamatthews

Not sure whether SSL2 is available on Ookami, but on Faguku it's not as good as OpenMP because of sector cache issue. Not sure the details but maybe it's related to the cache reuse. I will get you posted once I get the results about BLIS on Fugaku.

QingleiCao avatar Mar 11 '22 21:03 QingleiCao

I see. SSL2 is said to have a very restrictive licensing policy.

Does "less than 50GFlops single core" means "less than 50GFlops per core under 48-thread pthread run?

xrq-phys avatar Mar 13 '22 01:03 xrq-phys

nope. It's just run on single-core. The testing is simple, (1) initializing A, B, and C; (2) timer starts (3) calling DGEMM from BLIS of the single-threaded version (4) timer ends. Here are the detailed performance numbers. [qincao@fj172 ~]$ ./a.out 5000 5000 5000 5000 : 5.134509 seconds 48.690147 Gflop/s 5000 5000 5000 : 5.097357 seconds 49.045025 Gflop/s 5000 5000 5000 : 5.429460 seconds 46.045095 Gflop/s [qincao@fj172 ~]$ ./a.out 4000 4000 4000 4000 : 2.578140 seconds 49.648196 Gflop/s 4000 4000 4000 : 2.712698 seconds 47.185496 Gflop/s 4000 4000 4000 : 2.558646 seconds 50.026459 Gflop/s [qincao@fj172 ~]$ ./a.out 3000 3000 3000 3000 : 1.132928 seconds 47.664106 Gflop/s 3000 3000 3000 : 1.120047 seconds 48.212263 Gflop/s 3000 3000 3000 : 1.120782 seconds 48.180645 Gflop/s [qincao@fj172 ~]$ ./a.out 2000 2000 2000 2000 : 0.337060 seconds 47.469293 Gflop/s 2000 2000 2000 : 0.330443 seconds 48.419849 Gflop/s 2000 2000 2000 : 0.330329 seconds 48.436559 Gflop/s [qincao@fj172 ~]$ ./a.out 1000 1000 1000 1000 : 0.046236 seconds 43.256337 Gflop/s 1000 1000 1000 : 0.044159 seconds 45.290881 Gflop/s 1000 1000 1000 : 0.044110 seconds 45.341192 Gflop/s

QingleiCao avatar Mar 14 '22 15:03 QingleiCao