qmcpack Reduced basis not found in allowed number of iterations

Describe the bug QMCPACK abruptly terminates 1-2 seconds after it starts producing the following error in the error stdout: Reduced basis not found in allowed number of iterations. Check unit cell or contact a developer. Calculations were submitted using Nexus, the xsf structure file produced by Nexus looks fine.

To Reproduce Using QMCPACK github version: Last git commit date: Mon Apr 29 18:46:51 2024 -0400 Use the attached input files, except for the wavefunction. For the wavefunction, please let me know a suitable location to copy if you need it.

Expected behavior QMCPACK should recognize this as a valid structure.

System:

Cades, ORNL
module purge; source $MODULESHOME/init/bash; module load PE-intel/3.0; module swap intel intel/2021.1; module load intel/2021.1; module swap openmpi openmpi/4.1.0; module load gcc/10.2.0; module load python ;module load fftw/3.3.5; module load boost/1.70.0; module load libxml2/2.9.9; module list; LD_LIBRARY_PATH=/software/tools/compilers/intel_2021/mkl/2021.1.1/lib/intel64:$LD_LIBRARY_PATH
other systems where this is reproducible: None

Additional context files.tar.gz

May 07 '24 17:05 kayahans

Some comments, background:

I notice your cell is particularly "tall"

   a        b        c       alpha    beta     gamma
 4.26837  4.26837 27.00996  99.0925  99.0925  60.0000

and the error is from src/Particle/Lattice/LatticeAnalyzer.h

template<typename T>
inline void find_reduced_basis(TinyVector<TinyVector<T, 3>, 3>& rb)
{
  int maxIter = 10000;

  for (int count = 0; count < maxIter; count++)
  {
    TinyVector<TinyVector<T, 3>, 3> saved(rb);
    bool changed = false;
    for (int i = 0; i < 3; ++i)
    {
      rb[i]   = 0.0;
      changed = found_shorter_base(rb);
      rb[i]   = saved[i];
      if (changed)
        break;
    }
    if (!changed && !found_shorter_base(rb))
      return;
  }

  throw std::runtime_error("Reduced basis not found in allowed number of iterations. "
                           "Check unit cell or contact a developer.");
}

The algorithm being used is a bit strange and will need looking at. The failure occurs during initialization, well before any Monte Carlo. Presumably found_shorter_base is malfunctioning / is inefficient. The implementation has several numerical tolerance thresholds in it.

May 09 '24 17:05 prckent

This goes wrong after first use of the wavefunction. Can you please put the pwscf.pwscf.h5 in (say) the global shared on OLCF?

May 11 '24 21:05 prckent

@prckent Thank you Paul for following up. All the files are copied to /lustre/orion/mat151/world-shared/ksu/github_4975 in Frontier.

May 16 '24 20:05 kayahans

Tried a GCC 13.2 CPU build on nitrogen2 (RHEL9.3) and was not able to reproduce the problem. Will try CADES directly. Possibly there is a compiler or numerical tolerance issue.

Also, 138GiB wavefunction file!

Edit: Also tried Ubuntu 22.04, gcc 11.4 and clang 14.

May 17 '24 19:05 prckent

Please send me your build script or upload here. The one for CADES is clearly well out of date and there is not a new enough cmake available system-wide to build the development version.

May 17 '24 20:05 prckent