quda issues

Quda incompatibility with rtx20xx series graphics cards

1

At compile time with flag -DQUDA_GPU_ARCH=sm_75 for turing architecture, warnings appear as shown below: `ptxas warning : Value of threads per SM for entry ... is out of range. .minnctapersm...

ghost

Add an `instantiate` item for copy gauge field and copy gauge field offset, etc

Add an `instantiate` item for `copy_gauge_field` and `copy_gauge_field_offset` for the gauge orders, etc. One trick thing is that with the lists in `instantiate.h` it becomes hard to know which file...

hummingtree

clean-up

Implement STRIPED support for packing with NVSHMEM

mathiaswagner

optimization

Explore using thin-QR (Cholesky decompositions) for Gram-Schmidt

Instead of using a traditional implementation of classical or modified Gram-Schmidt (or a hybrid thereof), (block-)orthonormalization can be formulated as a thin QR, which is implemented in practice via a...

weinbe2

feature

optimization

Staggered fused operator feature request

1

* Modify dirac_[improved_]staggered.cpp to use the full operator for calling `MdagM` as opposed to separate even/odd parts. In theory this does the right thing: ``` Dslash(*tmp1, in, QUDA_INVALID_PARITY); DslashXpay(out, *tmp1,...

weinbe2

feature

optimization

Reduction abstraction needs to be applied to choice of MPI reducer in `TunableReduction`

Reduction abstraction is presently broken for non-summation reductions. While the abstracted launch can be passed different reducers for the kernel, the MPI reduction presently assumes that summation is being performed....

maddyscientist

bug

rocm-devel branch: hisq-stencil_test failure

rocm-devel branch (2f3b43a) built with ROCM 3.9.0 got error when runs hisq-stencil_test: ERROR: Error in unitarization component of the hisq fattening: 484 failures

yaomingamd

Target_HIP

Fixed Precision (half, quarter) IO for the eigensolver

Now that we have quarter precision deflation fixed on power 9, it is possible to compute a deflation space in single precision and ten deflate in half or quarter precision....

cpviolator

Add fine-grained parallelism + matrix tiling to computeCoarseClover

1

The routine `computeCoarseClover`: https://github.com/lattice/quda/blob/develop/include/kernels/coarse_op_kernel.cuh#L1014 Does not exploit a huge amount of parallelism as implemented, which turns into a bit of a nightmare when autotuning and could be a blocker in...

weinbe2

clean-up

optimization

sm_80: preconditioned twisted clover w/dynamic clover is slower in half precision, recon 8 than w/out dynamic clover

Double, recon 12 sees a boost. Half, recon 8 sees a regression. I don't have an apples-to-apples comparison for single (different recons), but they're included for posterity. ### With dynamic...

weinbe2

optimization

quda
quda copied to clipboard

Metadata

Quda incompatibility with rtx20xx series graphics cards

Add an `instantiate` item for copy gauge field and copy gauge field offset, etc

Implement STRIPED support for packing with NVSHMEM

Explore using thin-QR (Cholesky decompositions) for Gram-Schmidt

Staggered fused operator feature request

Reduction abstraction needs to be applied to choice of MPI reducer in `TunableReduction`

rocm-devel branch: hisq-stencil_test failure

Fixed Precision (half, quarter) IO for the eigensolver

Add fine-grained parallelism + matrix tiling to computeCoarseClover

sm_80: preconditioned twisted clover w/dynamic clover is slower in half precision, recon 8 than w/out dynamic clover

← Metadata

Owner

Metadata

quda quda copied to clipboard

Metadata

← Metadata

Owner

Metadata

quda
quda copied to clipboard