Fix reinitialization issues when using total forces in NAMD
(2025-12-09: rewritten description - @giacomofiorin)
Fixes two bugs triggered by the (relatively infrequent) combination of features when total forces are being used in a simulation where the system's topology is changed mid-run:
- Fix a boundary error when accessing atomic total forces from the previous step
- Fix wrong computation of the total forces after a configuration change (the symptom was a
nanin the output, due to a division by zero because not all Colvars properties had been set previously)
Fixes #778
~~FYI this is also in the integrate_cudagm branch~~ (outdated)
~~(Replaced with the commit I had before, so that there is no duplicate when we rebase either branch)~~ (outdated)
~~Ah I did not see that branch without a PR. If you'd rather merge the other branch soon, please close this PR and open a PR with integrate_cudagm.~~ (outdated)
While commit https://github.com/Colvars/colvars/pull/887/commits/8cc8ef9c65fa37820e8bf3130e059fc3af948896 seems to fix the problem, in local testing I still get a crash from 003_reinitatoms since atoms_map is not correctly re-initialized after cv reset.
Commit https://github.com/Colvars/colvars/pull/887/commits/8cc8ef9c65fa37820e8bf3130e059fc3af948896 should have solved the crash in https://github.com/Colvars/colvars/pull/891. As I have said, while the crash doesn't appear anymore on Github CI, I still get another crash in my local test with 003_reinitatoms, which might occur due to multiple reasons:
- When initializing atom groups, Colvars has to communicate with the proxy directly (the SOA code
init_atom_from_proxyinherits the design from the AOS code). In the case ofcolvarproxy_namd, this requires that theatoms_mapto be initialized with -1, but aftercv resetthe map is cleared. There is no way to re-initializeatoms_mapaftercolvarproxy_namd::reset()sincecolvarproxy::add_configandcolvarproxy::engine_ready()are not overridable. The only hacky way is to callinit_atoms_map()incolvarproxy_namd::reset(). - I don't know why the atom IDs with total forces enabled in NAMD GM are not filtered out for each client. I could apply the following patch to solve the issue:
diff --git a/src/GlobalMasterServer.C b/src/GlobalMasterServer.C
index 8988998b..a525f73a 100644
--- a/src/GlobalMasterServer.C
+++ b/src/GlobalMasterServer.C
@@ -468,6 +468,9 @@ int GlobalMasterServer::callClients() {
if (positions[i].atomID == rqAtoms[j]){
clientAtomPositions.add(receivedAtomPositions[positions[i].index]);
clientAtomIDs.add(rqAtoms[j]);
+ if (master->requestedTotalForces()) {
+ clientReceivedForceIDs.add(rqAtoms[j]);
+ }
--i; // rqAtoms may contain repeats
if ( ++j == num_atoms_requested ) break;
}
@@ -498,7 +501,7 @@ int GlobalMasterServer::callClients() {
gtf_i,gtf_i+(numForceSenders?master->old_num_groups_requested:0),
goi_i, goi_e, gov_i, gov_e,
forced_atoms_i,forced_atoms_e,forces_i,
- receivedForceIDs.begin(),receivedForceIDs.end(),receivedTotalForces.begin());
+ mf_i, mf_e, receivedTotalForces.begin());
NAMD_EVENT_STOP(1, NamdProfileEvent::GM_PROC_DATA);
a_i = receivedAtomIDs.begin();
p_i = receivedAtomPositions.begin();
But then (i) the list of total-force atom IDs is sorted the same as the list of normal atom IDs requesting positions, and this changes the numerical results of total forces and fails the tests requiring total forces, and (ii) I don't know if Colvars needs to "introspect" the forces from TCL forces. In other words, I don't know if this is what Colvars want;
- Solving the two issues above is still not enough. There is another crash in https://github.com/Colvars/colvars/blob/f83f1fe999c4ae1a5f65d3f94f90882944d89490/src/colvartypes.h#L348 Here, the boundary should be checked at first:
while ((i < this->size()) && (stream >> (*this)[i])) {
@giacomofiorin Could you look into the issues above?
Commit 8cc8ef9 should have solved the crash in #891. As I have said, while the crash doesn't appear anymore on Github CI, I still get another crash in my local test with
003_reinitatoms, which might occur due to multiple reasons:
- When initializing atom groups, Colvars has to communicate with the proxy directly (the SOA code
init_atom_from_proxyinherits the design from the AOS code). In the case ofcolvarproxy_namd, this requires that theatoms_mapto be initialized with -1, but aftercv resetthe map is cleared. There is no way to re-initializeatoms_mapaftercolvarproxy_namd::reset()sincecolvarproxy::add_configandcolvarproxy::engine_ready()are not overridable. The only hacky way is to callinit_atoms_map()incolvarproxy_namd::reset().- I don't know why the atom IDs with total forces enabled in NAMD GM are not filtered out for each client. I could apply the following patch to solve the issue:
diff --git a/src/GlobalMasterServer.C b/src/GlobalMasterServer.C index 8988998b..a525f73a 100644 --- a/src/GlobalMasterServer.C +++ b/src/GlobalMasterServer.C @@ -468,6 +468,9 @@ int GlobalMasterServer::callClients() { if (positions[i].atomID == rqAtoms[j]){ clientAtomPositions.add(receivedAtomPositions[positions[i].index]); clientAtomIDs.add(rqAtoms[j]); + if (master->requestedTotalForces()) { + clientReceivedForceIDs.add(rqAtoms[j]); + } --i; // rqAtoms may contain repeats if ( ++j == num_atoms_requested ) break; } @@ -498,7 +501,7 @@ int GlobalMasterServer::callClients() { gtf_i,gtf_i+(numForceSenders?master->old_num_groups_requested:0), goi_i, goi_e, gov_i, gov_e, forced_atoms_i,forced_atoms_e,forces_i, - receivedForceIDs.begin(),receivedForceIDs.end(),receivedTotalForces.begin()); + mf_i, mf_e, receivedTotalForces.begin()); NAMD_EVENT_STOP(1, NamdProfileEvent::GM_PROC_DATA); a_i = receivedAtomIDs.begin(); p_i = receivedAtomPositions.begin();But then (i) the list of total-force atom IDs is sorted the same as the list of normal atom IDs requesting positions, and this changes the numerical results of total forces and fails the tests requiring total forces, and (ii) I don't know if Colvars needs to "introspect" the forces from TCL forces. In other words, I don't know if this is what Colvars want;
Solving the two issues above is still not enough. There is another crash in https://github.com/Colvars/colvars/blob/f83f1fe999c4ae1a5f65d3f94f90882944d89490/src/colvartypes.h#L348
Here, the boundary should be checked at first:
while ((i < this->size()) && (stream >> (*this)[i])) {@giacomofiorin Could you look into the issues above?
I have worked out a NAMD patch for the second issue that sorts the total forces array correctly and therefore does not have the incorrect numerical results:
diff --git a/src/GlobalMasterServer.C b/src/GlobalMasterServer.C
index 8988998b..d04b6a5e 100644
--- a/src/GlobalMasterServer.C
+++ b/src/GlobalMasterServer.C
@@ -442,6 +442,30 @@ int GlobalMasterServer::callClients() {
}
sort(positions.begin(), positions.end(), atomID_less());
+ // Same as above, we need a vector to sort the atom IDs requiring total forces
+ vector <position_index> total_forces;
+ // Check if any of the clients requires total forces
+ bool total_forces_requested = false;
+ while (m_i != m_e) {
+ GlobalMaster *master = *m_i;
+ if (master->requestedTotalForces()) {
+ total_forces_requested = true;
+ break;
+ }
+ m_i++;
+ }
+ if (total_forces_requested) {
+ for (int j = 0; f_i != f_e; ++f_i, ++j) {
+ position_index temp;
+ temp.atomID = *f_i;
+ temp.index = j;
+ total_forces.push_back(temp);
+ }
+ sort(total_forces.begin(), total_forces.end(), atomID_less());
+ }
+
+ // Reset the client pointer;
+ m_i = clientList.begin();
/* call each of the masters with the coordinates */
while(m_i != m_e) {
int num_atoms_requested, num_groups_requested, num_gridobjs_requested;
@@ -474,6 +498,36 @@ int GlobalMasterServer::callClients() {
}
if ( j != num_atoms_requested ) NAMD_bug(
"GlobalMasterServer::callClients() did not find all requested atoms");
+ if (master->requestedTotalForces()) {
+ int k = 0;
+ // FIXME: There's no reliable way to check if this is really the first timestep
+ // At the first time step, the total forces may not be ready
+ const bool total_forces_ready = receivedTotalForces.size() == rqAtoms.size();
+ for (int i = 0; i < total_forces.size(); i++) {
+ if (total_forces[i].atomID == rqAtoms[k]){
+ clientReceivedForceIDs.add(rqAtoms[k]);
+ if (total_forces_ready) {
+ clientReceivedTotalForces.add(receivedTotalForces[total_forces[i].index]);
+ }
+ --i; // rqAtoms may contain repeats
+ if ( ++k == num_atoms_requested ) break;
+ }
+ }
+ // FIXME: For Colvars, it is possible to do
+ // cv configfile xxx.colvars
+ // run 20
+ // structure yyy.pdb
+ // cv reset
+ // cv configfile yyy.colvars
+ // run 20
+ // The problem arises when "yyy.colvars" and "xxx.colvars" select different sets of
+ // atoms for total forces. There is no good way to check if this is the first step and
+ // total forces should be skipped here (but Colvars would check it anyway and discard
+ // the invalid result).
+ // if ( total_forces_ready && k != num_atoms_requested ) {
+ // NAMD_bug("GlobalMasterServer::callClients() did not find all requested total forces");
+ // }
+ }
}
AtomIDList::iterator ma_i = clientAtomIDs.begin();
@@ -498,7 +552,7 @@ int GlobalMasterServer::callClients() {
gtf_i,gtf_i+(numForceSenders?master->old_num_groups_requested:0),
goi_i, goi_e, gov_i, gov_e,
forced_atoms_i,forced_atoms_e,forces_i,
- receivedForceIDs.begin(),receivedForceIDs.end(),receivedTotalForces.begin());
+ mf_i, mf_e, mtf_i);
NAMD_EVENT_STOP(1, NamdProfileEvent::GM_PROC_DATA);
a_i = receivedAtomIDs.begin();
p_i = receivedAtomPositions.begin();
But I don't know if Colvars really needs it.
* When initializing atom groups, Colvars has to communicate with the proxy directly (the SOA code `init_atom_from_proxy` inherits the design from the AOS code). In the case of `colvarproxy_namd`, this requires that the `atoms_map` to be initialized with -1, but after `cv reset` the map is cleared. There is no way to re-initialize `atoms_map` after `colvarproxy_namd::reset()` since `colvarproxy::add_config` and `colvarproxy::engine_ready()` are not overridable. The only hacky way is to call `init_atoms_map()` in `colvarproxy_namd::reset()`.
I would like to have a base-class function for the atoms map, because it's not a NAMD-specific issue. LAMMPS and GROMACS are not using it yet, but simply because they lack this optimization.
* I don't know why the atom IDs with total forces enabled in NAMD GM are not filtered out for each client. I could apply the following patch to solve the issue:
Ask the other folks at UIUC (including Jim), and you will find no fans of GlobalMaster's multiplexed design. But the alternative would have been multiple rounds of reduction/broadcast, and I think at the time it was not only more coding work but also potentially harmful to multinode performance.
(ii) I don't know if Colvars needs to "introspect" the forces from TCL forces. In other words, I don't know if this is what Colvars want;
I don't think we know what each person using Colvars really wants, but I agree that that's a very unlikely use case.
@giacomofiorin Could you look into the issues above?
Well, you did that just hours after
Editing the PR's title and description for merging. Will close #891
I would like to have a base-class function for the atoms map, because it's not a NAMD-specific issue. LAMMPS and GROMACS are not using it yet, but simply because they lack this optimization.
I didn't use atoms_map in the CUDAGM implementation as well. I think it might be something to workaround the issue when TCLForces and Colvars both use total forces, and Colvars needs the atoms_map to filter out the forces that are only required by other GM clients. But maybe I am wrong, I still don't fully understand the design and some history GlobalMaster code.
Ask the other folks at UIUC (including Jim), and you will find no fans of GlobalMaster's multiplexed design. But the alternative would have been multiple rounds of reduction/broadcast, and I think at the time it was not only more coding work but also potentially harmful to multinode performance.
I have tried to solve the issue and revise the NAMD patch that I posted here in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486. Could you take a look at that NAMD MR to see if Colvars really needs it to work correctly?
I didn't use
atoms_mapin the CUDAGM implementation as well. I think it might be something to workaround the issue when TCLForces and Colvars both use total forces, and Colvars needs theatoms_mapto filter out the forces that are only required by other GM clients. But maybe I am wrong, I still don't fully understand the design and some history GlobalMaster code.
It's not just because of the multiple GlobalMaster objects, but also because the messages received by GlobalMasterServer are not in order, either by task ID or by atom. Typical solutions to this would be sorting the atom buffer to match the order that Colvars expects, or using a lookup table. I am less familiar with the CUDAGM code in the NAMD repo than with the interface code that you have added here in the Colvars repo, but it seems that you are effectively sorting the atomic coordinates as you gather them?
I should also have been clearer on the other backends:
- The LAMMPS interface is not using
atoms_map, but it has its own lookup table that is based on internal LAMMPS code. (What it doesn't do compared to NAMD/GlobalMaster is skipping the MPI ranks that have no Colvars atoms). - GROMACS currently shares with Colvars the entire system (no need to sort anything), but it could use more optimization there: even with a GPU-resident interface for Colvars, which is currently just not there, GROMACS cannot run a MARTINI system in GPU-resident mode (even a large enough system that it would make sense). I don't know when that would change... People are still waiting for non-bonded tabulated potentials to be added back.
I have tried to solve the issue and revise the NAMD patch that I posted here in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486. Could you take a look at that NAMD MR to see if Colvars really needs it to work correctly?
Doing just that. I think it mostly depends on what the other NAMD folks will say. But if we can assume that the only atoms and forces that GlobalMasterColvars sees are the ones it requested, that would be an improvement.
I have tried to solve the issue and revise the NAMD patch that I posted here in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486. Could you take a look at that NAMD MR to see if Colvars really needs it to work correctly?
Given this open MR, we maybe wait a few days to merge this? Or merge it now (after including patches here to reproduce the NAMD changes that you're already submitting in that MR)?
I have tried to solve the issue and revise the NAMD patch that I posted here in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486. Could you take a look at that NAMD MR to see if Colvars really needs it to work correctly?
Given this open MR, we maybe wait a few days to merge this? Or merge it now (after including patches here to reproduce the NAMD changes that you're already submitting in that MR)?
This PR currently doesn't completely solve the crash issue. Even after this one and https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486 get merged, I still have no idea how to re-initialize the atoms_map in Colvars.
This PR currently doesn't completely solve the crash issue. Even after this one and https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486 get merged, I still have no idea how to re-initialize the
atoms_mapin Colvars.
You mean that there currently is no callback to Colvars to state that a new structure command has been loaded? If so, we should design a test that catches that condition better than the current 003_reinitatoms
(I can currently run that test successfully on my local machine even without the changes in your NAMD MR...)
You mean that there currently is no callback to Colvars to state that a new
structurecommand has been loaded? If so, we should design a test that catches that condition better than the current 003_reinitatoms
I'm not sure. Maybe you can make any one of add_config, engine_ready and parse_module_config virtual and re-implement it with calling init_atoms_map before calling the corresponding base class function.
(I can currently run that test successfully on my local machine even without the changes in your NAMD MR...)
It does run with the default compilation flags with optimizations, but doesn't work if you use CXXOPTS = -g in your NAMD Make.config.
I'm not sure. Maybe you can make any one of
add_config,engine_readyandparse_module_configvirtual and re-implement it with callinginit_atoms_mapbefore calling the corresponding base class function.
Yeah, that's definitely required, I'll do that.
The flip side of it is catching the use case when the system's topology changes while Colvars is still defined. That may need a change in NAMD.
The fix looks good! I wish I had seen the bug... Do you think a variant of the reinitatoms test could trigger the issue reliably?
Otherwise I'm fine with merging, but we won't have a regression test for this.
The fix looks good! I wish I had seen the bug... Do you think a variant of the
reinitatomstest could trigger the issue reliably? Otherwise I'm fine with merging, but we won't have a regression test for this.
I have no earthly idea! I couldn't see the bug locally either, I just took @HanatoK's word for it...
The fix looks good! I wish I had seen the bug... Do you think a variant of the
reinitatomstest could trigger the issue reliably? Otherwise I'm fine with merging, but we won't have a regression test for this.I have no earthly idea! I couldn't see the bug locally either, I just took @HanatoK's word for it...
Let me try to push a commit to force the boundary check and call cvm::error.
Hi @jhenin and @giacomofiorin ! Could you try https://github.com/Colvars/colvars/pull/887/commits/bc27da1c83e3b2bac6e1e8ddb652d7d138ba3293 locally? The ARM64 build currently skip the 003_reinitatoms test due to possibly the same crash.
Indeed! This does reveal the crash, and therefore the fix. Now I wish I had an ARM64 test box to investigate - do you?
Indeed! This does reveal the crash, and therefore the fix. Now I wish I had an ARM64 test box to investigate - do you?
I have one, but the issue now is that on ARM64, the total force is nan on the first step after the re-initialization (see https://github.com/Colvars/colvars/actions/runs/20033481534/job/57448548555#step:18:28), while it is -nan on x86-64, so the spiff fails. I have no idea about how to fix the issue so I have to disable the ARM64 test again.
the issue now is that on ARM64, the total force is
nanon the first step after the re-initialization (see https://github.com/Colvars/colvars/actions/runs/20033481534/job/57448548555#step:18:28), while it is-nanon x86-64, so the spiff fails. I have no idea about how to fix the issue so I have to disable the ARM64 test again.
We had commented on this issue already: https://github.com/Colvars/colvars/pull/764#issuecomment-2715384242 (which I quickly marked in #778)
It is, to be fair, a rather niche issue when people are changing the system's topology and are also gathering total forces from the atoms. The solution is a bit complicated, I'm about halfway there and need to check out for the day.
Question for both: since neither NAMD or LAMMPS provides total forces at the same time step, how do you feel about eliminating that use case to simplify the total-forces logic?
(As an alternative, we can just forget about the nan and merge this PR as is, with the needed updated to the description)
(As an alternative, we can just forget about the
nanand merge this PR as is, with the needed updated to the description)
Maybe you can revert https://github.com/Colvars/colvars/pull/887/commits/0a0e3c2c0625ac9e0b8842029626a9ebddbec2de before merging. That commit is only used to force the boundary check.
Maybe you can revert 0a0e3c2 before merging. That commit is only used to force the boundary check.
Let's keep it there, we can probably take the performance hit these days. It's worth noting that most use cases of straight ABF would involve centers of mass, for which the total force is available in aggregate form, i.e. bypassing the boundary check entirely.
@jhenin I think this is done now. Take a look?
I have made a NAMD MR to pass the step information to the CUDAGM clients (https://gitlab.com/tcbgUIUC/namd/-/merge_requests/492). The corresponding Colvars PR in https://github.com/Colvars/colvars/pull/899 could use the updated interface to invalidate the total forces for startup steps.
Looks good to me overall. Can you take a look at the GPU code path in ddd8277 and provide a way to reset the protected
ftfield there?
I saw that commit, and thinking about it, the only way I see is separating out what I just added here: https://github.com/Colvars/colvars/blob/ddd8277151b4a65122c67135a0a357289297a6b6/src/colvar.cpp#L1647-L1655 into a dedicated function, and call it there.
More in general, I see this as a side effect of the code duplication that we're starting to have (but we should try to reduce it)
I saw that commit, and thinking about it, the only way I see is separating out what I just added here: ... into a dedicated function, and call it there.
To be clear, I'll do that right now