quda
quda copied to clipboard
Analytic reconstruction of eigenvalues when using persistent deflation subspace methods
When shifting mass/mu in the inverter routines which employ deflations, one notices that one can reuse the eigenvectors, but the eigenvalues are shifted. At the present time, these new eigenvalues are recomputed explicitly, but they are able to be recomputed analytically due to shift being purely diagonal.
When working with very large O(1024) deflation spaces, but only a few solver instances for each mass, this explicit reconstruction can pose a significant performance loss. We should implement routines that exploit the diagonal shifts and analytically recompute eigenvalues.
I can confirm that when running on PizDaint, for example, the MG setup update with coarse grid deflation is very costly. So much so that for certain lattice sizes, "normal" MG (with coarse grid mu factor for twisted mass) is cheaper overall, despite the higher cost per RHS.
I am a bit surprised about this as when you introduced the persistent deflation subspace, my tests indicated that the cost of updating had increased by no more than a factor of two or so. I was, however, testing without inter-node communication on our local cluster here in Bonn. At the time, our contingent on PizDaint was depleted and I could not perform any tests there, otherwise I'd have seen this already then.
In fact, I have performed production on our cluster (48c96 lattice, 8xP100 on a single node) and the setup update (using a 1024 deflation subspace) takes 7 seconds, which is totally acceptable given that the solver is about a factor of 3 faster as a result.
On the other hand, when running on PizDaint (32 nodes, 64c128 lattice), I observe MG update times of around 400 seconds with the same subspace size. Could it be that part of the update is simply very strongly limited by some collectives which, for some reason, run particularly poorly? I've done tests with various develop HEAD commits, the test on our cluster was with 28287a7, while my recent tests on PizDaint were with 53e85c52 (although I had similar behaviour with commits from December).
Can you check if this issue persists when using the unit test multigird_evolve_test? @cpviolator can give you the relevant parameters. If this reproduces the issue, then we can take a look directly without needing your exact workflow.
Can you check if this issue persists when using the unit test multigird_evolve_test? @cpviolator can give you the relevant parameters. If this reproduces the issue, then we can take a look directly without needing your exact workflow.
Trying right now to see if CMake manages to download all the USQCD dependencies to build the tests for me. I'll test with multigrid_evolve_test on PizDaint. I have the following sample command from when the feature was introduced:
./multigrid_evolve_test --solve-type direct-pc --sdim 8 --tdim 16 --mg-eig 1 true --mg-eig-nEv 1 24 --mg-eig-nKr 1 48 --mg-nvec 1 24 --verbosity verbose --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-eig-preserve-deflation true --mg-levels 2 --mg-block-size 0 2 2 2 2 --mass -0.1 --dslash-type twisted-clover --mu 0.1
which I'll try to adjust to a more realistic setup (I seem to remember that I've done this before, at least locally).
@kostrzewa I find that USQCD download on summit can be finicky. Here's a script to do it automatically:
#!/bin/bash
INSTALL_DIR=/ccs/home/howarth/IO_STACK/install
mkdir -p $INSTALL_DIR
#Clone the QMP repo
git clone https://github.com/usqcd-software/qmp.git
#configure, make, install
(cd qmp; aclocal; autoreconf -i -f -v; ./configure --with-qmp-comms-type=MPI CFLAGS="-O2 -std=c99 -funroll-all-loops -fopenmp -D_REENTRANT" CC=mpicc CXX=mpicxx --prefix=$INSTALL\
_DIR; make -j 8; make install)
#Clone the QIO repo
git clone https://github.com/usqcd-software/qio.git
#configure, submodules, make, install
(cd qio; git submodule update --init --recursive; aclocal; autoreconf -i -f -v; ./configure --with-qmp=$INSTALL_DIR --enable-largefile CC=mpicc CXX=mpicxx CFLAGS="-O2 -std=c99 -\
funroll-all-loops -fopenmp -D_REENTRANT" --prefix=$INSTALL_DIR; make -j 8; make install)
I'll also cook up a command that sees this large delay at L=64 scale.
@kostrzewa I find that USQCD download on summit can be finicky. Here's a script to do it automatically:
Thanks, CMake seems to have managed, though. It seems like I need to enable QUDA_GAUGE_ALG, however. So off to another compile cycle.
@kostrzewa I think that QUDA_GAUGE_ALG issue is fixed in latest develop (with many more compilation improvements).
@kostrzewa I think that
QUDA_GAUGE_ALGissue is fixed in latest develop (with many more compilation improvements).
Definitely latest develop head commit:
$ git pull origin develop
From https://github.com/lattice/quda
* branch develop -> FETCH_HEAD
Already up to date.
QUDA_GAUGE_ALG is also an explicit prerequisite for building multigrid_evolve_test (understandably so) if tests/CMakeLists.txt is to be believed.
Got the tests spun up, kudos to @mathiaswagner and @maddyscientist for the CMake magic that makes these things so comfortable now. Will test a 64c128 tomorrow.
Just an update: I would have already reported back, however things have turned out to be less plain sailing than they looked initially. Issues linking with cuFFT even though the stub is present... I'll see what I can do.
I don't know what set of circumstances led to the tests linking properly when I said that I had gotten them compiled in https://github.com/lattice/quda/issues/941#issuecomment-575360064. It seems that I need to do something as proposed in #957 to get the gauge-related tests to link properly on PizDaint.
Alright, with the compilation issues worked around, the next problem to be faced are nan in the plaquette and topological charge (whether I start with random gauge or read in the ETMC 64c128 physical point conf.0500 which you've used for testing in the past):
======================================================
Running MG gauge evolution test at constant quark mass
======================================================
step=0 plaquette = -nan topological charge = -nan, mass = -0.413881 kappa = 0.139427, mu = 0.00072
Creating new clover field
Source: CPU = 2.68444e+08, CUDA copy = 2.68444e+08
Solution: CPU = 0, CUDA copy = 0
Prepared source = -nan
Prepared solution = 0
Prepared source post mass rescale = -nan
Creating a BICGSTABL solver
BiCGstab-4: 0 iterations, <r,r> = -nan, |r|/|b| = -nan
This happens also on a single node with a 16c32 lattice, so I'm at a bit of a loss where to go from here...
This happens also on a single node with a 16c32 lattice, so I'm at a bit of a loss where to go from here...
Note that multigrid_invert_test works fine with the same parameters (both with and without gauge I/O).
Disabling GPU-Direct RDMA access
Disabling peer-to-peer access
Rank order is row major (x running fastest)
running the following test:
prec sloppy_prec link_recon sloppy_link_recon S_dimension T_dimension Ls_dimension
double single 12 8 64/32/32 16 16
MG parameters
- number of levels 3
- level 1 number of null-space vectors 24
- level 1 number of pre-smoother applications 0
- level 1 number of post-smoother applications 4
- level 2 number of null-space vectors 24
- level 2 number of pre-smoother applications 0
- level 2 number of post-smoother applications 4
Outer solver paramers
- pipeline = 8
Eigensolver parameters
Grid partition info: X Y Z T
0 1 1 1
set_layout layout set for 32 nodes
open_test_input: QIO_open_read done.
open_test_input: User file info is "<?xml version="1.0" encoding="UTF-8"?>
<NonSciDACFile/>
"
read_gauge_field: reading su3 field
read_field: QIO_read_record_data returns status 0
DML_partition_in times: read 11.98 send 0.00 total 13.83
read_field: QIO_read_record_data returns status 0
read_gauge_field: Closed file for reading
QUDA 1.0.0 (git v0.9.0-2436-g53e85c521-dirty-sm_60)
CUDA Driver version = 10010
CUDA Runtime version = 10010
Found device 0: Tesla P100-PCIE-16GB
Using device 0: Tesla P100-PCIE-16GB
WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Not using device memory pool allocator
WARNING: Using pinned memory pool allocator
Loaded 770 sets of cached parameters from /users/bartek/local/quda_resources/PizDaint-build_test-quda_develop-dynamic_clover-with_tests-with_qio-53e85c521f11d3a94166b951e14bf8640540ec24-sm_60_gdr0_p2p0/tunecache.tsv
Computed plaquette is 5.542798e-01 (spatial = 5.542734e-01, temporal = 5.542862e-01)
Creating new clover field
Creating vector of nullptr space fields of length 24
MG level 1 (GPU): WARNING: Exceeded maximum iterations 1500
MG level 1 (GPU): CG: Convergence at 1500 iterations, L2 relative residual: iterated = 1.564395e-06, true = 1.714657e-06 (requested = 5.000000e-07)
[...]
@cpviolator I need to provide some more info on the comment above (https://github.com/lattice/quda/issues/941#issuecomment-575206858). I have just realized that I have in fact run on PizDaint on 80 nodes with a 80c160 lattice employing coarse grid deflation and got reasonable times for the MG Setup update (only about a factor of 2 or 2.5 longer than with regular MG) using commit 3fa55816a. This overhead was more than made up for by the ~3x speedup of inversion times even with a suboptimal number of setup updates.
I will need to test some more with some current develop head commit to see if there is perhaps a regression, either originating in QUDA or its interface which triggers a costlier setup update than I had in the past as I remember more recent tests to give very poor timings. For this, it would be immensely helpful to get mutlgrid_evolve_test running without NaNs. I will try to open an issue to diagnose that specifically if I find some time...