quda
quda copied to clipboard
QUDA is a library for performing calculations in lattice QCD on GPUs.
At compile time with flag -DQUDA_GPU_ARCH=sm_75 for turing architecture, warnings appear as shown below: `ptxas warning : Value of threads per SM for entry ... is out of range. .minnctapersm...
Add an `instantiate` item for `copy_gauge_field` and `copy_gauge_field_offset` for the gauge orders, etc. One trick thing is that with the lists in `instantiate.h` it becomes hard to know which file...
Instead of using a traditional implementation of classical or modified Gram-Schmidt (or a hybrid thereof), (block-)orthonormalization can be formulated as a thin QR, which is implemented in practice via a...
* Modify dirac_[improved_]staggered.cpp to use the full operator for calling `MdagM` as opposed to separate even/odd parts. In theory this does the right thing: ``` Dslash(*tmp1, in, QUDA_INVALID_PARITY); DslashXpay(out, *tmp1,...
Reduction abstraction is presently broken for non-summation reductions. While the abstracted launch can be passed different reducers for the kernel, the MPI reduction presently assumes that summation is being performed....
rocm-devel branch (2f3b43a) built with ROCM 3.9.0 got error when runs hisq-stencil_test: ERROR: Error in unitarization component of the hisq fattening: 484 failures
Now that we have quarter precision deflation fixed on power 9, it is possible to compute a deflation space in single precision and ten deflate in half or quarter precision....
The routine `computeCoarseClover`: https://github.com/lattice/quda/blob/develop/include/kernels/coarse_op_kernel.cuh#L1014 Does not exploit a huge amount of parallelism as implemented, which turns into a bit of a nightmare when autotuning and could be a blocker in...
Double, recon 12 sees a boost. Half, recon 8 sees a regression. I don't have an apples-to-apples comparison for single (different recons), but they're included for posterity. ### With dynamic...