Grid Performance problem of HMC with GPU

Hello,

the performance of HMC (Moebius fermion with stout smearing) with GPU is much slower than expected. I would expect roughly two times faster than on 1 KNL, but 5 times slower. Could anyone give me any suggestions or comments?

I suspect two things: my configuration is improper or some parts of HMC on the host cpu might be the bottleneck. attached files: grid.configure.summary, logs from benchmaks/Benchmark_*

grid: feature/gpu-port 12eb2a6a34 gpu: Tesla V100-PCIE-32GB host: skylake gold 6242 (1 cpu + 1 gpu) cuda 10.2 g++ 6.5.0

my configuration: export MPICXX=g++ export CXX=nvcc export CXXFLAGS="-ccbin g++ -arch=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++11" export LDFLAGS= ../configure --enable-precision=double --enable-simd=GPU --enable-comms=none --enable-gen-simd-width=64 --enable-openmp

grid.configure.summary.txt Benchmarks.log.tar.gz

Apr 02 '20 02:04 i-kanamori

Hi Isaaku, can you identify where in the HMC its is running slow? it prints continual timestamps and should be easy to compare the logs Peter

Apr 21 '20 17:04 paboyle

Hi Peter,

thank you very much for taking your time, the force and smearing are slow. The solvers are fine. Here is an example of comparison with 6 KNL ( with exactly the same parameters including random numbers ).

rational: force 10676 ms (smearing 487.492 ms, solver 1396ms) [6 KNL]
--> 467294 ms (smearing 13491 ms, solver 2253ms) [gpu] quotient: force 1702.03 ms (smearing 415.55 ms, solver 929ms) --> 34568.5ms (smearing 13530 ms, solver 1993ms) gauge: force 119.954 ms (no smearing) --> 3463.35 ms

the corresponding logs from gpu run:
(rational) Grid : Message : 70.376951 s : CGMultiShift: shift[15] true residual 4.76531e-12 Grid : Message : 70.376956 s : Time Breakdown Grid : Message : 70.376957 s : Elapsed 2.252818 s Grid : Message : 70.376958 s : AXPY 0.631815 s Grid : Message : 70.376959 s : Marix 1.414250 s Grid : Message : 70.376960 s : Shift 0.000000 s Grid : Integrator : 513.192394 s : Smearing (on/off): true Grid : Message : 526.683984 s : Smearing force in 13491.6 ms Grid : Integrator : 526.703793 s : [0][0] Force average: 0.0785065 Grid : Message : 526.735575 s : [0][0] P update elapsed time: 467617 ms (force: 467294 ms)

(quotient) Grid : Message : 564.467300 s : ConjugateGradient Converged on iteration 75 Grid : Message : 564.467312 s : Computed residual 8.80529e-11 Grid : Message : 564.467322 s : True residual 8.80529e-11 Grid : Message : 564.467323 s : Target 1e-10 Grid : Message : 564.467324 s : Time breakdown Grid : Message : 564.467325 s : Elapsed 1.993376 s Grid : Message : 564.467326 s : Matrix 1.776804 s Grid : Message : 564.467327 s : Linalg 0.216468 s Grid : Message : 564.467328 s : Inner 0.565750 s Grid : Message : 564.467330 s : AxpyNorm 0.623110 s Grid : Message : 564.467331 s : LinearComb 0.974960 s Grid : Integrator : 582.867661 s : Smearing (on/off): true Grid : Message : 596.397717 s : Smearing force in 13530 ms Grid : Integrator : 596.418235 s : [1][0] Force average: 0.175014 Grid : Message : 596.420355 s : [1][0] P update elapsed time: 34881.6 ms (force: 34568.5 ms)

(gauge) Grid : Integrator : 599.870055 s : Smearing (on/off): false Grid : Integrator : 599.882313 s : [2][0] Force average: 7.09722 Grid : Message : 599.884545 s : [2][0] P update elapsed time: 3463.35 ms (force: 3151.85 ms)

Apr 22 '20 07:04 i-kanamori

OK, suspect thrashing in between host and GPU in the Cshift which I haven't optimised. Will try and optimise that and see if it helps.

Apr 22 '20 14:04 paboyle

Cshift confirmed indeed slow, have been working on optimising. Have a much faster version in most cases, but not in develop yet. Will get back to you.

May 11 '20 02:05 paboyle

Thanks so much, I will wait.

May 11 '20 02:05 i-kanamori

Hi Isaaku, can you try the branch "sycl" - I have speed up the Cshift 12x on a V100. Not sure if this will be enough, but you are also running a pretty light quark mass at 75 iterations.

May 11 '20 21:05 paboyle

I have sped up the Shift 12x (on 16^3 volume) on V100. Since the drop was 50x this is perhaps not a big enough factor yet, but your solver is only 75 iterations which suggests it's not very light quark mass yet.

May 11 '20 21:05 paboyle

Sorry - I broke the code, hold off on that

May 11 '20 21:05 paboyle

looks ok, I was doing something wrong. This branch configures with modified commands

../configure --enable-accelerator=cuda --enable-comms=mpi --enable-simd=GPU CXX=nvcc CXXFLAGS="-ccbin mpicxx -gencode arch=compute_70,code=sm_70 -std=c++11"

May 11 '20 23:05 paboyle

Hi Peter, thank you very much, I will try it. Issaku

May 12 '20 05:05 i-kanamori

It runs much faster than before (great!), but the result is wrong: the Hamailtonian is no longer coservred (dH=0.00583017617464066 --> 2125889.73222768). I will pin down and report which part causes the difference.

The stout smearing has also a problem in compiling for which I needed to modify the code (seems the latest change in the develop https://github.com/paboyle/Grid/commit/2e652431e5248461d4fecb75b37b0b00fc40e405 also has the same problem, I will post an independent issue for this).

speed up for reference: rational: force 467294 ms (smearing 13491 ms, solver 2253ms) [before] --> force: 69859.2 ms (smearing 5208.6 ms, solver 2501.9ms) [new]

quotient: force 34568.5ms (smearing 13530 ms, solver 1993ms) [before] --> 9281.7 ms (smearing 5079.6 ms, solver ????) [new]

gauge: force 3463.35 ms [before] --> 510.315 ms [new]

May 14 '20 09:05 i-kanamori

Hi,

it seems the forces for domainwall/moebius fermion have problem. I tried tests/core, tests/hmc, tests/forces, and found some tests fail. (see below). Since I have not configured fft, so the most of failed test in core should be natural.

The attended are logs for failed tests, I believe that the filename is self explaining.

Here is the list of failed test. Those which require input file/configuration are omitted, as they trivially fail due to lack of the input.

tests/forces (due to the assertion) Test_contfrac_force Test_mobius_force Test_partfrac_force Test_zmobius_force

tests/core (due to the assertion; also fails in develop) Test_cf_coarsen_support Test_fft Test_fft_gfix Test_fftf Test_poisson_fft Test_qed

tests/hmc (due to the assertion; also fails in develop) Test_hmc_EOWilsonCloverFermionGauge assert ok, but too large dH which is do not occur in the develop branch (logs are omitted as they are too large) Test_hmc_EODWFRatio Test_hmc_EODWFRatio_Gparity Test_hmc_Mobius2p1

test_core.tar.gz test_force.tar.gz test_hmc.tar.gz grid.configure.summary.txt

Jun 01 '20 07:06 i-kanamori

Hmm... Summit, V100. SYCL branch.

./Test_contfrac_force Grid : Message : ================================================ Grid : Message : MPI is initialised and logging filters activated Grid : Message : ================================================ Grid : Message : Requested 1073741824 byte stencil comms buffers Grid : Message : MemoryManager Cache 13529146982 bytes Grid : Message : 0.197209 s : Grid is setup to use 128 threads Grid : Message : 1.865663 s : s = 0 Beta 4.77389 Aee 10.5026 See 10.5026 Grid : Message : 1.865727 s : s = 1 Beta 2.72314 Aee -5.99092 See -6.08613 Grid : Message : 1.865736 s : s = 2 Beta 1.87546 Aee 4.12601 See 4.29032 Grid : Message : 1.865744 s : s = 3 Beta 1.35352 Aee -2.97773 See -3.21082 Grid : Message : 1.865750 s : s = 4 Beta 0.969697 Aee 2.13333 See 2.44478 Grid : Message : 1.865756 s : s = 5 Beta 0.65625 Aee -1.44375 See -1.85278 Grid : Message : 1.865762 s : s = 6 Beta 0.380952 Aee 0.838095 See 1.37782 Grid : Message : 1.865768 s : s = 7 Beta 0.125 Aee -0.275 See -1.00078 Grid : Message : 1.865775 s : s = 8 Beta 0 Aee 1.0202 See 2.01942 Grid : Message : 4.564774 s : S (1.4102e+07,1.37069e-13) Grid : Message : 4.564837 s : Sprime (1.4102e+07,5.53464e-13) Grid : Message : 4.566000 s : dS (1.77475,4.16394e-13) Grid : Message : 4.566020 s : predict dS (1.77478,-4.05564e-17) Grid : Message : 4.566035 s : Done

./Test_mobius_force Grid : Message : ================================================ Grid : Message : MPI is initialised and logging filters activated Grid : Message : ================================================ Grid : Message : Requested 1073741824 byte stencil comms buffers Grid : Message : MemoryManager Cache 13529146982 bytes Grid : Message : 0.197566 s : Grid is setup to use 128 threads Grid : Message : 1.937984 s : MobiusFermion (b=0.5,c=0.5) with Ls= 8 Tanh approx Grid : Message : 5.242100 s : -- S (4.94101e+06,-1.31552e-13) Grid : Message : 5.246500 s : -- Sprime (4.94101e+06,-9.39394e-14) Grid : Message : 5.247400 s : dS (1.15257,3.76123e-14) Grid : Message : 5.248200 s : predict dS (1.15246,5.79883e-18) Grid : Message : 5.248900 s : Done

./Test_partfrac_force Grid : Message : ================================================ Grid : Message : MPI is initialised and logging filters activated Grid : Message : ================================================ Grid : Message : Requested 1073741824 byte stencil comms buffers Grid : Message : MemoryManager Cache 13529146982 bytes Grid : Message : 0.205088 s : Grid is setup to use 128 threads Grid : Message : 4.994149 s : S (5.64436e+07,3.09951e-12) Grid : Message : 4.994240 s : Sprime (5.64436e+07,6.55769e-13) Grid : Message : 4.994251 s : dS (12.8772,-2.44374e-12) Grid : Message : 4.994260 s : predict dS (12.8777,-9.49241e-17) Grid : Message : 4.994269 s : Done

./Test_zmobius_force Grid : Message : ================================================ Grid : Message : MPI is initialised and logging filters activated Grid : Message : ================================================ Grid : Message : Requested 1073741824 byte stencil comms buffers Grid : Message : MemoryManager Cache 13529146982 bytes Grid : Message : 0.197321 s : Grid is setup to use 128 threads Grid : Message : 1.831676 s : ZMobiusFermion (b=0.5,c=0.5) with Ls= 8 gamma passed in Grid : Message : 4.565847 s : S (7.8636e+07,4.5666e-12) Grid : Message : 4.565940 s : Sprime (7.8636e+07,-8.09063e-14) Grid : Message : 4.565971 s : dS (47.8762,-4.6475e-12) Grid : Message : 4.566000 s : predict dS (47.25,-7.27218e-15) Grid : Message : 4.566030 s : Done

I'll go back and check with more standard compile options because I've been playing with experimental code.

----- GIT VERSION ------------------------------------- commit: 8285e41 branch: sycl date : 2020-05-21 ----- PLATFORM ---------------------------------------- architecture (build) : powerpc64le os (build) : linux-gnu architecture (target) : powerpc64le os (target) : linux-gnu compiler vendor : gnu compiler version : ----- BUILD OPTIONS ----------------------------------- SIMD : GPU (width= 64) Threading : yes Acceleration : cuda Unified virtual memory : no Communications type : none Shared memory allocator : shmopen Shared memory mmap path : /var/lib/hugetlbfs/global/pagesize-2MB/ Default precision : double Software FP16 conversion : yes RNG choice : sitmo GMP : yes LAPACK : no FFTW : no LIME (ILDG support) : yes HDF5 : no build DOXYGEN documentation : no ----- BUILD FLAGS ------------------------------------- CXXFLAGS: -I/autofs/nccs-svm1_home1/paboyle/sycl/Grid -O3 -ccbin mpicxx -gencode arch=compute_70,code=sm_70 -I/ccs/home/paboyle/prefix/include/ -std=c++11 -Xcompiler -fno-strict-aliasing --expt-extended-lambda --expt-relaxed-constexpr -Xcompiler -fopenmp LDFLAGS: -L/autofs/nccs-svm1_home1/paboyle/sycl/Grid/build-cuda/Grid -L/ccs/home/paboyle/prefix/lib/ LIBS: -lz -lcrypto -llime -lmpfr -lgmp -lstdc++ -lm -lz

module load gcc/6.4.0 module load cuda/10.1.168

Jun 04 '20 19:06 paboyle

Test_hmc_IwasakiGauge

Grid : Message : 182.640130 s : Total H after trajectory = 118697.576176948 dH = -0.262491766450694 Grid : Message : 182.640230 s : -------------------------------------------------- Grid : Message : 182.640270 s : exp(-dH) = 1.30017 Random = 0.854012 Grid : Message : 182.640330 s : Acc. Probability = 1 Grid : Message : 182.640390 s : Metropolis_test -- ACCEPTED

Will look at the Fermion HMC's now.

Interestingly - this is being tested by me in a NEW mode for Grid WITHOUT relying on unified virtual memory, so it's a big change. (--enable-unified = yes|no )

Having HMC run in a mode where there is explicit data motion & device / host memory is a huge step.

Jun 06 '20 02:06 paboyle

For me the hmc_EODWFRatio is working... puzzled will look at your logs now

Jun 06 '20 02:06 paboyle

Hmm..

Test_hmc_EOWilsonCloverFermionGauge: /nfshome/kanamori/work/Grid_tmp/Grid_sycl/Grid/qcd/action/pseudofermion/TwoFlavourEvenOdd.h:171: void Grid::TwoFlavourEvenOddPseudoFermionAction<Impl>::deriv(const GaugeField&, Grid::TwoFlavourEvenOddPseudoFermionAction<Impl>::GaugeField&) [with Impl = Grid::WilsonImpl<Grid::Grid_simd<thrust::complex, Grid::GpuVector<4, Grid::GpuComplex > >, Grid::FundamentalRep<3>, Grid::CoeffReal>; Grid::TwoFlavourEvenOddPseudoFermionAction<Impl>::GaugeField = Grid::Lattice<Grid::iVector<Grid::iScalar<Grid::iMatrix<Grid::Grid_simd<thrust::complex, Grid::GpuVector<4, Grid::GpuComplex > >, 3> >, 4> >]: Assertion `FermOp.ConstEE() == 1' failed.

Is a dumb ass logic bomb. Should never happen - Clover doesn't have a Const EE portion so this combination could never have worked. Guido added then clover - I did Wilson and all the Overlap/DWF/Chiiral types. Azusa did the Staggered types.

Do you have the fails on DWF/Mobius ?

Jun 06 '20 02:06 paboyle

Might make sense for us to focus on the Force unit tests first - these pass for me and fail for you

Jun 06 '20 02:06 paboyle

Final dH from [email protected]$ ./Test_hmc_EODWFRatio

Grid : Message : 1852.775293 s : Total H after trajectory = 521482.769108906 dH = -0.00109183281892911

Jun 06 '20 03:06 paboyle

NB... this was on the --enable-unified=no compile - will take a look now on the --enable-unified=yes.

Don't expect any differences - UVM is a LOT easier to get right than explicit motion, but I've been changing the code lots.

Jun 06 '20 03:06 paboyle

Interesting -- on Summit I have a fail of Test_hmc_EODWFRatio under UVM compile, while it worked under non-UVM compile.

Curious.

Jun 06 '20 04:06 paboyle

with an earlier version https://github.com/paboyle/Grid/commit/07c0c02f8c1d58605150c4729f7d1b3b32416045 (the latest version takes too long to compile, by the way, some object file take almost two hours and the result is not yet coming) This failure only happened withTest_hmc_EOWilsonCloverFermionGauge. With DW/Mobius, what I encountered are: failed due to the lack of inputfiles: Test_hmc_EOMobiusRatio, Test_hmc_EOMobiusRatioManyFlavour

no assertion but too large dH (much larger than develop version) Test_hmc_Mobius2p1f (1st dH) Grid : Message : 65.428026 s : Total H after trajectory = 661674.243433422 dH = 6824.99434900098 Test_hmc_EODWFRatio (1st dH) Grid : Message : 38.631159 s : Total H after trajectory = 498863.052664993 dH = 13044.891522149 Test_hmc_EODWFRatio_Gparity (1st dH) Grid : Message : 218.917450 s : Total H after trajectory = 889003.489302024 dH = 9395.01561373216

Hmm..

Test_hmc_EOWilsonCloverFermionGauge: /nfshome/kanamori/work/Grid_tmp/Grid_sycl/Grid/qcd/action/pseudofermion/TwoFlavourEvenOdd.h:171: void Grid::TwoFlavourEvenOddPseudoFermionAction::deriv(const GaugeField&, Grid::TwoFlavourEvenOddPseudoFermionAction::GaugeField&) [with Impl = Grid::WilsonImpl<Grid::Grid_simd<thrust::complex, Grid::GpuVector<4, Grid::GpuComplex > >, Grid::FundamentalRep<3>, Grid::CoeffReal>; Grid::TwoFlavourEvenOddPseudoFermionAction::GaugeField = Grid::Lattice<Grid::iVector<Grid::iScalar<Grid::iMatrix<Grid::Grid_simd<thrust::complex, Grid::GpuVector<4, Grid::GpuComplex > >, 3> >, 4> >]: Assertion `FermOp.ConstEE() == 1' failed.

Is a dumb ass logic bomb. Should never happen - Clover doesn't have a Const EE portion so this combination could never have worked. Guido added then clover - I did Wilson and all the Overlap/DWF/Chiiral types. Azusa did the Staggered types.

Do you have the fails on DWF/Mobius ?

Jun 08 '20 14:06 i-kanamori

Hi Isaaku,

yes - I have a fail with Test_hmc_EODWFRatio. The dH is O(130) in the bad run, and O(0.2) in the good run.

It's clearly associated with the DWF pseudofermion part - in that if I run with the pseudofermioin removed so it drops back to quenched the dH is 10^-2.

The initial H differs in 6th decimal place, so not necessarily restricted to the force.

They should have been bitwise identical, and I'm setting them up side by side and looking for first difference. I'm surprised that the non-UVM works and it's the UVM that broke.... quite hard to do!

Jun 08 '20 14:06 paboyle

Hi Peter, thank you so much, nice to hear that you have reproduced the problem.

Jun 08 '20 14:06 i-kanamori

Good news Isaaku - I have a clear and simple smoking gun now in a force test.

Jun 08 '20 16:06 paboyle

Hi, I now know what the problem was and how to fix. Will take me rest of today to get it fixed committed and tested. More info later. Peter

Jun 08 '20 20:06 paboyle

Hi Isaaku, I think it is fixed now - please try again. Peter

Jun 08 '20 21:06 paboyle

I had made the hopping term kernel call for a single direction be non-blocking. This affected only the GPU and specifically only UVM because the non-UVM did a blocking transfer before use.... Long war story, but think it is fixed.

Jun 08 '20 21:06 paboyle

Test_hmc_EODWFRatio Total H after trajectory = 485818.345329456 dH = 0.184186611615587 Total H after trajectory = 498539.948841221 dH = 0.0624700901098549 Total H after trajectory = 504627.963376912 dH = -0.0723185333190486 Total H after trajectory = 509220.340263757 dH = 0.0109063376439735 Total H after trajectory = 511986.658981007 dH = 0.0274079806404188

Jun 08 '20 21:06 paboyle

Hi Peter, I confirmed that the forces are all fine, with Unified virtual memory : yes I will continue with checking HMC. Thank you very much!

Jun 09 '20 09:06 i-kanamori

I confirmed that the hmc tests are fine, and my original HMC (2+1 flavor Mobius with stout smearing) also gives the correct result. Thanks so much!

The performance (with 16 threads in the host) is now between the very first (very slow) and the middle (faster but wrong result), and actually 80% slower than solely using the host. I suspect that there might be some left over of the debugging code. I appreciate any comments/suggestions for further acceleration.

rational force 467294 ms (smearing 13491 ms, solver 2226+3891+2253 ms) [initial] --> force: 69859.2 ms (smearing 5208.6 ms, solver 2138+3508+2502 ms) [before] --> force: 75441.3 ms (smearing 6202.95 ms, solver 3175+5647+3251 ms) [new] [in the previous, I qouted only the thrid one for the solvers]

quotient: force 34568.5ms (smearing 13530 ms, solver 1993ms) [initial] --> 9281.7 ms (smearing 5079.6 ms, solver ????) [before] --> 12474 ms (smearing 6019.26 ms, solver ????) [new]

gauge: force 3463.35 ms [initial] --> 510.315 ms [before] --> 1125.46 ms [new] grid.configure.summary.txt

Jun 10 '20 02:06 i-kanamori

Grid Grid copied to clipboard

Performance problem of HMC with GPU

Grid
Grid copied to clipboard