ProLIF
ProLIF copied to clipboard
Memory Issue and Segmentation Fault When Increasing Cutoff in ProLIF Selection with big systems
I’m experiencing a problem when using ProLIF to analyze interactions between different domains. My system includes one domain with approximately 550 residues. To manage the selection size, I used a command like: "PROA": "segid PROA and byres (around 11 (segid PROB))" This limits the selection and it works well, but unfortunately, it doesn’t capture all interface interactions. I verified this by selecting different residues in PROA and found additional interactions that weren’t included in the initial selection.
When I increase the cutoff from 11 to 13 or 15 to capture these missing interactions, the number of atoms increases substantially, leading to a segmentation fault error: line 30: 261962 Segmentation fault (core dumped) Could you advise on how to handle this memory issue or suggest alternative approaches to refine the selection without losing critical interactions?
Hi @mohamedmarzouk22
I guess one approach could be to first go through the trajectory and record the list of PROA residues that are just close enough for interactions with PROB (although I'm not sure how efficient the distance selection is tbh, it might take a while), and then make a selection based on that list of residues and use it for ProLIF. Something like:
from tqdm.auto import tqdm
# step between frames, adjust accordingly
step = 10
# find residues within reach for interactions
residues = set()
for timestep in tqdm(u.trajectory[::step]):
sel = u.select_atoms("segid PROA and byres (around 6 (segid PROB))")
residues.add(sel.residues)
print(residues)
# make selection to use in ProLIF
final_selection = u.select_atoms(
" or ".join(
f"(resid {r.resid} and chain {r.chainID} and segid {r.segid})"
for r in residues
)
)
Tell me if that works
Thank you for the suggestion
I tried this suggestion, but even with a 5 Å cutoff, I ended up with 100 residues in PROA, along with some from PROC. It doesn’t work as expected. Reducing the cutoff selects fewer residues, but I end up missing important interactions.
Is there any information about a limit on the number of atoms?
A 5 Å cutoff won't be enough, you need at least 6 or 7A but anyway that doesn't solve the problem as you still get a segmentation fault.
I'd suggest using a selection that will be big enough to find all interactions like segid PROA and byres (around 20 (segid PROB)), then split that selection in chunks that your machine can manage, doing the fingerprint analysis for each chunk separately and then concatenating the results. Something like:
proa = u.select_atoms("segid PROA and byres (around 20 (segid PROB))")
prob = u.select_atoms("segid PROB")
residue_chunk_size = 500 # set to how many residues your machine manages to run with
fp = plf.Fingerprint()
def chunk_residues(residue_group, chunk_size):
i = 0
while i <= residue_group.n_residues:
res_slice = slice(i, i+chunk_size)
yield residue_group[res_slice]
i += chunk_size
# run on chunks
ifps = []
for residue_chunk in chunk_residues(proa.residues, residue_chunk_size):
fp.run(u.trajectory, residue_chunk.atoms, prob)
ifps.append(fp.ifp)
# concatenate IFPs (ignore last one as it is already stored in the Fingerprint object)
for ifp in ifps[:-1]:
for frame, data in ifp.items():
fp.ifp[frame].update(data)
fp.to_dataframe()
Hopefully this works!
Thank you for your suggestion, and I am sorry for the delayed response.
Unfortunately, I also tried this approach, but I still encountered the same error (Segmentation fault (core dumped)) in the last chunk.
However, I noticed something interesting: • This error only appears with large trajectories. • When I used a selection radius of 20 Å, resulting in approximately 180 residues, the script worked when applied to a trajectory with only 100 frames. • However, when applying the same script to a different trajectory of the same system (but with 25,000 frames), the error occurred. • For large trajectories, reducing the selection radius (resulting in about 50 residues) allows the script to run successfully. • Despite this, the error persists even when using chunks to run the analysis.
Would you have any further suggestions on how to handle this issue with large trajectories?
Here is the error encountered with large trajectories and a large number of residues.
Here is the working case with a large number of residues for a trajectory with a small number of frames.
Hi again, and no worries for the delayed response!
What were you using for residue_chunk_size in the snippet I shared above? I put it at 500 initially but that's ridiculously high for an already large protein-protein system! Maybe try to use 10 to start with (and 1 if it really doesn't want to work)
I tried selecting 20 residues, then 10, but the error persists regardless of whether the trajectory is large or small.