blis icon indicating copy to clipboard operation
blis copied to clipboard

How to link BLIS with OpenMPI

Open banskt opened this issue 4 years ago • 23 comments

I am trying to run HPL benchmark using different BLAS libraries (OpenBLAS, MKL and BLIS). The problem is I cannot get BLIS to work satisfactorily with OpenMPI. The numbers (GFlops) in the following table suggests that the BLIS library is working with OpenMP (albeit worse than other methods) but cannot take advantage of OpenMPI. What am I doing wrong in the benchmark? Is it the compilation of BLIS or something else? Basically, I do not know how to troubleshoot from here.

Table for GFlops

BLAS Library OpenMP OpenMPI
BLIS 313.6 91.7
OpenBLAS 361.6 343.3
MKL 327.9 314.6

Hardware

  • AMD Ryzen 7 3700X
  • Motherboard MSI MAG B550M Mortar
  • Memory 2x DDR4-3200 16GB (32GB total)
  • 1TB Samsung 970 EVO Plus NVMe M.2
  • NVIDIA GTX 1060 3GB GPU

Software

  • Ubuntu 20.04
  • GCC 10.2.0
  • OpenMPI 4.0.5
  • OpenBLAS 0.3.10
  • MKL 2020.4.304
  • BLIS

Here is my installation script for BLIS:

module load gcc/10.2.0
git clone https://github.com/flame/blis.git
cd blis
./configure -p /opt/amd/amd-blis-2.2-4 --enable-cblas --enable-threading=openmp CFLAGS="-march=znver2 -Ofast" zen2
make -j8
sudo make install

The full makefile can be seen here. Here are the relevant lines:

LAdir        = /opt/amd/amd-blis-2.2-4/lib
LAlib        = $(LAdir)/libblis.a
CC           = gcc
CCFLAGS      = $(HPL_DEFS) -std=c99 -march=znver2 -fomit-frame-pointer -Ofast -funroll-loops -W -Wall -fopenmp -Wno-misleading-indentation
LINKER       = gcc
LINKFLAGS    = $(CCFLAGS)

Script for running HPL

module load gcc/10.2.0 openmpi/gcc/4.0.5 amd/blis/2.2-4
export OMP_NUM_THREADS=1
mpirun -np 8 --map-by core --mca btl self,vader xhpl

banskt avatar Dec 04 '20 21:12 banskt

For the MPI run have you set either BLIS_NUM_THREADS or OMP_NUM_THREADS? The other thing to check would be the process affinity. Are all cores actually being utilized or are multiple processes piling up on one core?

devinamatthews avatar Dec 04 '20 21:12 devinamatthews

BTW, you should not set CFLAGS when configuring BLIS, although I don't think this is the problem.

devinamatthews avatar Dec 04 '20 21:12 devinamatthews

Thanks, I have set OMP_NUM_THREADS=1 (updated the main text). All the cores are being utilized. Here's a screenshot:

screenshot

Thanks for the compilation note on CFLAGS.

banskt avatar Dec 04 '20 22:12 banskt

Huh. My next thought would be to build BLIS without threading and try the MPI again.

devinamatthews avatar Dec 04 '20 22:12 devinamatthews

I just compiled BLIS without threading and tried with MPI again. It doesn't improve the benchmark.

./configure -p /opt/amd/amd-blis-2.2-4 --enable-cblas --disable-threading zen2

banskt avatar Dec 04 '20 23:12 banskt

@kvaragan do you have any experience with HPL on AMD? I'm not sure if this is an HPL-related issue, and AMD-related issue, or something else altogether.

devinamatthews avatar Dec 05 '20 15:12 devinamatthews

Update: I found that the ~92 GFlops obtained with BLIS + OpenMPI is approximately same as the value obtained with single-core OpenMP.

banskt avatar Dec 06 '20 02:12 banskt

@banskt did you link OpenBLAS and MKL the exact same way as BLIS? You could even link generically to libblas.so and set it as a symlink to either library at runtime.

Also, are you using plain gcc to link the MPI version instead of mpicc?

devinamatthews avatar Dec 06 '20 02:12 devinamatthews

@banskt When running pure MPI based HPL using MKL or OpenBLAS - Did you set OMP_NUM_THREADS=1 ?. These libraries by default run multi-threaded and when you set OMP_NUM_THREADS=1, the pure MPI based performance of these libraries will be similar to BLIS - can you confirm this ?

Non-scaling issues are generally related to MPI binding settings. May be try running hpl as numactl -C 0-7 mpirun -np 8 ./xhpl

kvaragan avatar Dec 06 '20 04:12 kvaragan

@banskt I think you are using gcc to compile and link (CC= gcc and LINKER=gcc). Can you try with mpicc?

sandiprt avatar Dec 07 '20 03:12 sandiprt

@devinamatthews

did you link OpenBLAS and MKL the exact same way as BLIS? You could even link generically to libblas.so and set it as a symlink to either library at runtime.

I used OpenBLAS and MKL the exact same way as BLIS. I did not use the symlink but I have explicitly set the path to libblas.so while compiling HPL.

Also, are you using plain gcc to link the MPI version instead of mpicc?

I have tried both, gcc and mpicc

@kvaragan

When running pure MPI based HPL using MKL or OpenBLAS - Did you set OMP_NUM_THREADS=1 ?

Yes, I have set OMP_NUM_THREADS=1 while running MPI based HPL using MKL and OpenBLAS. I have set

export OMP_PROC_BIND=TRUE
export OMP_PLACES=cores
export OMP_NUM_THREADS=1
export BLIS_IR_NT=1
export BLIS_JR_NT=1
export BLIS_IC_NT=1
export BLIS_JC_NT=1

I am using these environment variables as default on all MPI tests, unless otherwise stated. For OpenMP tests, I am using BLIS_IC_NT=8 and OMP_NUM_THREADS=8.

These libraries by default run multi-threaded and when you set OMP_NUM_THREADS=1, the pure MPI based performance of these libraries will be similar to BLIS - can you confirm this ?

The performance of all the libraries (BLIS, OpenBLAS and MKL) are same when I set OMP_NUM_THREADS=1 and run using mpirun -np 1 ./xhpl and I get similar output for all methods (60-90 GFlops).

Note The pure MPI performance of OpenBLAS and MKL are better than that of OpenBLAS -- that is the issue we are dealing with here. All methods perform similarly on multithreading. However, as I understand your comment, you have a doubt whether the improvement in OpenBLAS and MKL are due to multithreading or pure MPI. Hence, I used a single core to show that the thread settings are being applied correctly. Is there any other way to ensure that the performance of OpenBLAS and MKL are due to pure MPI and not due to multi-threading?

Non-scaling issues are generally related to MPI binding settings. May be try running hpl as numactl -C 0-7 mpirun -np 8 ./xhpl

When I tried running it as numactl -C 0-7 mpirun -np 8 ./xhpl , I found that every core gets 25% usage but the overall performance is same, producing ~15 GFlops in the benchmark. I used mpicc for compiling HPL and OMP_NUM_THREADS=1,

screenshot

@sandiprt

I think you are using gcc to compile and link (CC= gcc and LINKER=gcc). Can you try with mpicc?

I have tried using mpicc and the results are same.

banskt avatar Dec 07 '20 17:12 banskt

The MPI behavior is puzzling, especially when you have configure BLIS without threading.

For OpenMP, though, do not set any BLIS_XX_NT variables. Performance will be better with just OMP_NUM_THREADS.

devinamatthews avatar Dec 07 '20 17:12 devinamatthews

@banskt do you have a master Makefile that can build each of the 6 different configurations ({BLIS,OpenBLAS,MKL}x{OpenMP,MPI})? If so please share it.

devinamatthews avatar Dec 07 '20 17:12 devinamatthews

I am sorry for the late update @devinamatthews

The Makefiles for HPL are attached. Each of them is run separately. I know its redundant, but just for the sake of completion, here is an example for running the BLIS with OpenMP and MPI:

ARCH="Linux_ZEN2_BLIS"
BUILDDIR="~/Documents/hpl-benchmark/hpl-build/${BUILDDIR}"
tar -zxf hpl-2.3.tar.gz -C ${BUILDDIR}
cd ${BUILDDIR}
cp ~/Documents/hpl-benchmark/hpl-makefiles/Make.${ARCH} .
make -j8 arch=${ARCH}
cp ~/Documents/hpl-benchmark/hpl-makefiles/HPL-mpi1-omp8.dat bin/${ARCH}/HPL.dat
cd bin/${ARCH}
export OMP_NUM_THREADS=8
./xhpl
unset OMP_NUM_THREADS

ARCH="Linux_ZEN2_BLIS_MPI"
BUILDDIR="~/Documents/hpl-benchmark/hpl-build/${BUILDDIR}"
tar -zxf hpl-2.3.tar.gz -C ${BUILDDIR}
cd ${BUILDDIR}
cp ~/Documents/hpl-benchmark/hpl-makefiles/Make.${ARCH} .
make -j8 arch=${ARCH}
cp ~/Documents/hpl-benchmark/hpl-makefiles/HPL-mpi8-omp1.dat bin/${ARCH}/HPL.dat
cd bin/${ARCH}
export OMP_PROC_BIND=TRUE
export OMP_PLACES=cores
export OMP_NUM_THREADS=1
mpirun -np 8 --map-by core --mca btl self,vader xhpl
unset OMP_NUM_THREADS
unset OMP_PLACES
unset OMP_PROC_BIND

You have to change the Makefiles to reflect the correct file paths in your system for LAdir and LAlib

hpl-makefiles.tar.gz

banskt avatar Dec 07 '20 17:12 banskt

@devinamatthews could you manage to run the HPL benchmark?

banskt avatar Jan 04 '21 20:01 banskt

@banskt I am stumped and I do not have time to actually try to run and play around with it. It would be helpful if you could produce ONE single tar.gz that contains everything needed (including BLIS/HPL sources, or the script could download the right version) and which contains ONE single script that will build and run everything. @kvaragan can you help?

devinamatthews avatar Jan 04 '21 21:01 devinamatthews

@banskt I am stumped and I do not have time to actually try to run and play around with it. It would be helpful if you could produce ONE single tar.gz that contains everything needed (including BLIS/HPL sources, or the script could download the right version) and which contains ONE single script that will build and run everything. @kvaragan can you help?

Ah, I see, I am sorry to bother. I will prepare one single command to run everything.

banskt avatar Jan 04 '21 21:01 banskt

@banskt - can you write a mail to [email protected]. They will help you.

kvaragan avatar Jan 05 '21 03:01 kvaragan

I meet same problem when I wanna run my tensorflow application with blis,but I found it didn't work to replace mkl with blis.so directly. Then I consider the reason maybe the differences between mkl api and blis api operations。

Reference: github:flame/docs/BLISTypedAPI.md intel official website: oneAPI math kernel library -c/developer @reference/pblas

@banskt @kvaragan @sandiprt If here is a solution,can you update it? thanks

seuwins avatar May 05 '21 03:05 seuwins

BLIS and MKL support standard BLAS interfaces. So interface is definitely not the reason. Let me check with my team, they tried it in the past. @shrutiramesh1988 can you help ?

kvaragan avatar May 05 '21 05:05 kvaragan

BLIS and MKL support standard BLAS interfaces. So interface is definitely not the reason. Let me check with my team, they tried it in the past. @shrutiramesh1988 can you help ?

If you can supply the way that tensorflow cpu version link standard libblis.so, I will be appreciative. Now I am using AMD Roman CPU platform for development.

tensorflow: 2.4.0 python: 3.6.9 pip: 9.0.1 bazel: 3.1.0 gcc: 10.2.0 OS:centos 7

seuwins avatar May 05 '21 09:05 seuwins

I came to know that its difficult to link BLIS directly with TensorFlow since MKLDNN path takes MKL by default. It was suggested to use ZENDNN with TF in order to link BLIS. Here ZenDNN links BLIS by default. If you want BLIS to work with TF, then one way is to get TF+ZenDNN build available at: https://developer.amd.com/zendnn/

kvaragan avatar May 05 '21 12:05 kvaragan

I came to know that its difficult to link BLIS directly with TensorFlow since MKLDNN path takes MKL by default. It was suggested to use ZENDNN with TF in order to link BLIS. Here ZenDNN links BLIS by default. If you want BLIS to work with TF, then one way is to get TF+ZenDNN build available at: https://developer.amd.com/zendnn/

Thanks for your advice, I build TF_v1.15_ZenDNN_v3.0 successful, but I see only python lib in /usr/local/lib/python3.7/site-packages/tensorflow_core. My app code is c++,so that I cannot compile or bazel build c++ app without tensorflow_cc.so.

Below is the files after pip3 install tensorflow-1.15.5-cp37-cp37m-linux_x86_64.whl image

seuwins avatar May 06 '21 14:05 seuwins