Reduced basis not found in allowed number of iterations
Describe the bug
QMCPACK abruptly terminates 1-2 seconds after it starts producing the following error in the error stdout:
Reduced basis not found in allowed number of iterations. Check unit cell or contact a developer.
Calculations were submitted using Nexus, the xsf structure file produced by Nexus looks fine.
To Reproduce Using QMCPACK github version: Last git commit date: Mon Apr 29 18:46:51 2024 -0400 Use the attached input files, except for the wavefunction. For the wavefunction, please let me know a suitable location to copy if you need it.
Expected behavior QMCPACK should recognize this as a valid structure.
System:
- Cades, ORNL
- module purge; source $MODULESHOME/init/bash; module load PE-intel/3.0; module swap intel intel/2021.1; module load intel/2021.1; module swap openmpi openmpi/4.1.0; module load gcc/10.2.0; module load python ;module load fftw/3.3.5; module load boost/1.70.0; module load libxml2/2.9.9; module list; LD_LIBRARY_PATH=/software/tools/compilers/intel_2021/mkl/2021.1.1/lib/intel64:$LD_LIBRARY_PATH
- other systems where this is reproducible: None
Additional context files.tar.gz
Some comments, background:
I notice your cell is particularly "tall"
a b c alpha beta gamma
4.26837 4.26837 27.00996 99.0925 99.0925 60.0000
and the error is from src/Particle/Lattice/LatticeAnalyzer.h
template<typename T>
inline void find_reduced_basis(TinyVector<TinyVector<T, 3>, 3>& rb)
{
int maxIter = 10000;
for (int count = 0; count < maxIter; count++)
{
TinyVector<TinyVector<T, 3>, 3> saved(rb);
bool changed = false;
for (int i = 0; i < 3; ++i)
{
rb[i] = 0.0;
changed = found_shorter_base(rb);
rb[i] = saved[i];
if (changed)
break;
}
if (!changed && !found_shorter_base(rb))
return;
}
throw std::runtime_error("Reduced basis not found in allowed number of iterations. "
"Check unit cell or contact a developer.");
}
The algorithm being used is a bit strange and will need looking at. The failure occurs during initialization, well before any Monte Carlo. Presumably found_shorter_base is malfunctioning / is inefficient. The implementation has several numerical tolerance thresholds in it.
This goes wrong after first use of the wavefunction. Can you please put the pwscf.pwscf.h5 in (say) the global shared on OLCF?
@prckent Thank you Paul for following up. All the files are copied to /lustre/orion/mat151/world-shared/ksu/github_4975 in Frontier.
Tried a GCC 13.2 CPU build on nitrogen2 (RHEL9.3) and was not able to reproduce the problem. Will try CADES directly. Possibly there is a compiler or numerical tolerance issue.
Also, 138GiB wavefunction file!
Edit: Also tried Ubuntu 22.04, gcc 11.4 and clang 14.
Please send me your build script or upload here. The one for CADES is clearly well out of date and there is not a new enough cmake available system-wide to build the development version.