mdanalysis
mdanalysis copied to clipboard
occupancy and b-factors not updated when importing multi-frame PBD
MDAnalysis uses the occupancies/b-factors of the first frame for all subsequent frames when importing a multi-frame PDB.
Code to reproduce the behavior
import MDAnalysis
u = MDAnalysis.Universe('traj.pdb')
sel = u.select_atoms("resid 178 and segid A and name HD2")
for _ in u.trajectory:
print(sel.positions)
for _ in u.trajectory:
print(list(sel.occupancies))
for _ in u.trajectory:
print(list(sel.tempfactors))
Also,
grep 'HD2 ASPTA 178' traj.pdb
ATOM 2808 HD2 ASPTA 178 81.480 35.990 49.120 0.00 0.00 H
ATOM 2808 HD2 ASPTA 178 83.700 34.570 50.290 1.00 1.00 H
ATOM 2808 HD2 ASPTA 178 80.540 40.020 49.060 0.06 0.06 H
ATOM 2808 HD2 ASPTA 178 81.100 37.920 49.720 0.20 0.20 H
ATOM 2808 HD2 ASPTA 178 84.130 38.260 50.810 0.09 0.09 H
ATOM 2808 HD2 ASPTA 178 81.790 39.980 49.520 0.04 0.04 H
And link to traj.pdb.
Expected behavior
[[81.48 35.99 49.12]]
[[83.7 34.57 50.29]]
[[80.54 40.02 49.06]]
[[81.1 37.92 49.72]]
[[84.13 38.26 50.81]]
[[81.79 39.98 49.52]]
[0.00]
[1.00]
[0.06]
[0.20]
[0.09]
[0.04]
[0.00]
[1.00]
[0.06]
[0.20]
[0.09]
[0.04]
Actual behavior
[[81.48 35.99 49.12]]
[[83.7 34.57 50.29]]
[[80.54 40.02 49.06]]
[[81.1 37.92 49.72]]
[[84.13 38.26 50.81]]
[[81.79 39.98 49.52]]
[0.0]
[0.0]
[0.0]
[0.0]
[0.0]
[0.0]
[0.0]
[0.0]
[0.0]
[0.0]
[0.0]
[0.0]
Current version of MDAnalysis
- Which version are you using? 2.3.0
- Which version of Python ? 3.10.6
- Which operating system? MacOS
This is an interesting one. We treat things like the occupancy and bfactors main attributes as a "topology" things, so they don't don't get updated as you read a new frame.
However, specifically for occupancy, we do actually store that time dependent data under ts.data.
So for example if you did:
import MDAnalysis as mda
u = mda.Universe('traj.pdb')
for ts in u.trajectory:
print(ts.data['occupancy'])
You should be able to access the time dependent occupancy data.
Adding tempfactors to this isn't a very difficult task, and would only require adding similar code bits to what we do with occupancy in this method: https://github.com/MDAnalysis/mdanalysis/blob/e7ee5a4a077c2179d6037ba634240ce7ebcb76ef/package/MDAnalysis/coordinates/PDB.py#L377
This offers a quick fix, but doesn't remove the fact that we have two competing attributes and asking folks to look in ts.data over the topology attribute isn't very intuitive / error prone. Hooking back the ts.data values back into the topology attribute might require a bit more work / extra thought. @orbeckst had one option in #3504 which could maybe work (https://github.com/MDAnalysis/mdanalysis/issues/3504#issuecomment-1046297997) but probably changes the expected behaviour of some attributes.
Hi, thanks for getting back to me so quickly. I wasn't aware one is able to access the time-depedent data through ts.data. This is very helpful in my particular case.
I can second the want for this. Lots of people store frame-specific data in the b_factor column for visualisation purposes, and it would be really useful to have easy access to that value. I've just tested out the use of the occupancy column and works well (https://github.com/BradyAJohnston/MolecularNodes/issues/128#issuecomment-1368705191) but having it available for tempfactors would be great.
@BradyAJohnston for arbitrary frame specific data we do have the auxiliary system, that probably would be the best place to fetch & store such data if looking to interfacing formally with the MDAnalysis API.
@IAlibay Thanks but the use case I have I use MDAnalysis under the hood so there is no user interfacing with it. The data goes straight from topology / trajectory to 3D model inside of Blender so the arbitrary data needs to already exist inside of the trajectory (in this case it would be multi-model .pdb files usually).
Dynamically updating a topology is a behaviour change I doubt we could allow any time soon (it's a pretty big overhaul of how we expect things to work), is there any way users could pass in a separate file with timeseries data that could be read alongside? I.e. like a CSV, XVG, or EDR file or maybe some kind of serialised xarray input?
So one way to (ab)use the current tooling would be via a Transformation... these are in some ways a callback that is triggered on each frame read, and they could have vision of an AtomGroup (even of all atoms). They are intended to modify the Timestep (positions) based on AtomGroups (i.e. unwrapping coords) but you could flip this relationship
class BFactorUpdater(TransformationBase):
def __init__(self, ag):
self._ag = ag
def _transform(self, ts):
self._ag.bfactors = ts.data['occupancy'] # assumes this is populated by the Reader, true for PDB
return ts
u = mda.Universe()
bfactor_updater = BFactorUpdater(u.atoms)
u.trajectory.add_transformations([bfactor_updater])
that's some scrap code off the top of my head, but it should be possible along those lines
For my use case at least, the accessing via:
for ts in univ.traj:
ts.data['tempfactors']
With the same setup as the occupancies would work totally fine. I don't need it to be topology related, just a frame-specific data access.
There is potential for reading .csv or other formats alongside the data and adding it via index, but as it's already common practice for a lot of people to use the (now deprecated) .pdb format for storing arbitrary data in the B-factor column, I'm hoping to support that as best I can. For now at least getting them to store the data inside of the occupancy instead will achieve the same result, but having access to ts.data['tempfactor'] would be ideal.
but as it's already common practice for a lot of people to use the (now deprecated) .pdb format for storing arbitrary data in the B-factor column,
Herein lies the issue. It's a non-standard use case for a deprecated format. I myself am guilty of abusing the PDB format this way so I'm happy to open a PR with the relevant fix as a temporary stopgap, but I would very much like to open the discussion about how this should be dealt with going forward.
Having these arbitrary changes to readers are both user unfriendly and end up having a reasonably large maintenance cost in the long run. I'd like to be able to offer something that suits the downstream needs that would be the canonical way we encourage folks to use things.
I would certainly be very appreciative of any direction you wish to take, both in short-term solutions and longer-term design plans.