BioSimSpace icon indicating copy to clipboard operation
BioSimSpace copied to clipboard

Slow performance of combine.py

Open SofiaBariami opened this issue 5 years ago • 15 comments

Hello, I am using BioSimSpace to to combine a protein and a ligand to the same system. The protein is made out of 4 fragments, that I used combine.py to assemble. I did that while working remotely from abroad, and I noticed that I had to leave the script running overnight in order to finish. The problem was when BSS.IO.readMolecules() was trying to read the protein, that consists of ~6100 atoms, a number that doesn't justify the time needed to run. At that point I didn't pay much attention to that problem. Today however, I stumbled upon the same problem, when I had to combine the protein with the ligand. It would take hours to combine them, so I tried to use Parmed:

import parmed as pmd
f1 = pmd.load_file("G1_4a.prm7","G1_4a.rst7")
f2= pmd.load_file("protein_final.prm7","protein_final.rst7")
f3 = f1 + f2

f3.save("protein_4a.rst7")
f3.save("protein_4a.prmtop")

With parmed, the process takes a couple of minutes to be completed, but the new files cannot be loaded in BioSimSpace..

BSS.IO.readMolecules(["protein_4a.rst7","protein_4a.prmtop"])
OSError: Failed to read molecules from ['protein_4a.rst7', 'protein_4a.prmtop']. It looks like you failed to include a topology file.

..even if the combined system seems to be "parameterised":

In [67]: f3                                                                                                                                                                              
Out[67]: <AmberParm 6154 atoms; 5 residues; 6242 bonds; parametrized>

By comparing the new parameters file of the combined system protein_4a.prmtop, with the parameters file of just the protein, I noticed that there were many parameters missing form the %FLAG NONBONDED_PARM_INDEX section of the file of the system, as well as from the %FLAG LENNARD_JONES_ACOEF and %FLAG LENNARD_JONES_BCOEF. I also had to erase the zeros under the SOLTY flag, that were causing problems with parmed.

Something that I should mention is that the parm7 file of the protein, is 4635319 lines, that might explain the slow performance of BSS, but it was the result of the combined protein fragments, so I don't think we can do something about it. Also, some of the atom names are just numbers, that might also cause a problem with bss (?). If you want to take a look at the files, you can find them here. Please let me know if you have any suggestions about this problem. Thanks a lot!

SofiaBariami avatar Jan 08 '20 16:01 SofiaBariami

I'll take a look and let you know if I can figure out what might be happening.

ppxasjsm avatar Jan 08 '20 16:01 ppxasjsm

Hi Sofia,

Could you activate verbose mode in BioSimSpace to see what the Sire error is?

See here for details: https://github.com/michellab/BioSimSpace/issues/137

I might have some time later too take a look at the performance issue.

Cheers.

On Wed, 8 Jan 2020, 16:35 Toni Mey, [email protected] wrote:

I'll take a look and let you know if I can figure out what might be happening.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/michellab/BioSimSpace/issues/139?email_source=notifications&email_token=AAE6K3MBE4HA54LK7TO6WA3Q4X6GHA5CNFSM4KEKX5O2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEINFN2I#issuecomment-572151529, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE6K3NQFSM2SMWSKYX32W3Q4X6GHANCNFSM4KEKX5OQ .

lohedges avatar Jan 08 '20 17:01 lohedges

Ho @SofiaBariami. Your link doesn't contain the two files G1_4a.prm7 and G1_4a.rst7 so I can't fully test the performance issue that you describe. However, I've simply tried loading the two protein_final.* files with BioSimSpace:

import BioSimSpace as BSS

BSS.setVerbose(True)

s = BSS.IO.readMolecules(["protein_final.rst7", "protein_final.prm7"])

This has currently been running for 15 minutes without success. I think the issue is the prm7 file, which is 358 MB in size. It looks like the Sire AmberPrm parser is struggling with this. I'm not 100% sure where the bottleneck is, but I wouldn't be surprised if it was a problem with manipulating a very large non-bonded matrix. I'll have a better look when I have a chance.

lohedges avatar Jan 08 '20 19:01 lohedges

@SofiaBariami This doesn’t make sense to me, the protein has what 6000 atoms? The prm file should be way smaller.

Sent from my iPhone

On 8 Jan 2020, at 19:39, Lester Hedges [email protected] wrote:



Ho @SofiaBariamihttps://github.com/SofiaBariami. Your link doesn't contain the two files G1_4a.prm7 and G1_4a.rst7 so I can't fully test the performance issue that you describe. However, I've simply tried loading the two protein_final.* files with BioSimSpace:

import BioSimSpace as BSS

BSS.setVerbose(True)

s = BSS.IO.readMolecules(["protein_final.rst7", "protein_final.prm7"])

This has currently been running for 15 minutes without success. I think the issue is the prm7 file, which is 358 MB in size. It looks like the Sire AmberPrm parser is struggling with this. I'm not 100% sure where the bottleneck is, but I wouldn't be surprised if it was a problem with manipulating a very large non-bonded matrix. I'll have a better look when I have a chance.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/michellab/BioSimSpace/issues/139?email_source=notifications&email_token=ACZN3ZC6KZPFAA7L3FTRVPDQ4YTWRA5CNFSM4KEKX5O2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEINXG5Q#issuecomment-572224374, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACZN3ZCKOVAEZUNEZWSU3NDQ4YTWRANCNFSM4KEKX5OQ.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

jmichel80 avatar Jan 08 '20 20:01 jmichel80

My guess is that there is something weird with the non bonded pairs list. The amber parm format writes all N^2 pairs to the file if they are non-zero, which would go some way to explaining 360MB for 6000 atoms. Processing of non bonded pairs in Sire/BioSimSpace is an order N^2 operation that is also very slow.

Have you looked at the protein in VMD or pymol (or whatever the new viewer for 21st Century modellers is ;-))? Does the protein look ok? Have you tried working with only one of the protein chains? Are there non-standard or otherwise strange residues?

chryswoods avatar Jan 08 '20 22:01 chryswoods

Based on the discussion @SofiaBariami and I had this morning, it looks like reading QUBE forcefield into amber format may not be possible at least not the way it is done at the moment. It seems through the XML each atom is essentially its own type with name charge etc, but amber topology files only support 999 atom types as far as I understand. Therefore if you end up with a system of 4500ish atoms you can't write it out correctly or then be able to read it using vmd/pymol, or in fact read it to run a simulation.

ppxasjsm avatar Jan 09 '20 10:01 ppxasjsm

@ppxasjsm where did you read that you can only use 999 atom types. I don't see here why this would be the case https://ambermd.org/FileFormats.php#topo.cntrl ?

@SofiaBariami Based on your above comment

In [67]: f3
Out[67]: <AmberParm 6154 atoms; 5 residues; 6242 bonds; parametrized>

It looks like each molecule you loaded is a single residue, so you end up with a protein made of 5 residues with a huge number of atoms. I wonder if that could explain some of the issues you are encountering. Can you think of a way to generate prm7 files with a more reasonable number of residues.

Pinging @djcole56 for possible comments.

jmichel80 avatar Jan 09 '20 11:01 jmichel80

Also all of the protein fragments are named QUP, so that might be an issue too

SofiaBariami avatar Jan 09 '20 11:01 SofiaBariami

I may be misunderstanding the parmtop format. Not massively familiar with it. But essentially atom names have the format: FORMAT(20a4) (ITITL(i), i=1,20) (see documentation). I am not sure if you can change that, but it means that once we move to say H1000, it is displayed as 1000 and no more spaces between the names. Would the parser still understand what is going on?

ppxasjsm avatar Jan 09 '20 11:01 ppxasjsm

I used another pdb file, with each aminoacid defined seperately. Now the generated prm files looks normal (at least the atom names):

%FLAG ATOM_NAME
%FORMAT(20a4)
H1  CH3 H2  H3  C   O   N   H   CA  HA  CB  HB2 HB3 CG  OD1 OD2 C   O   N   H
CA  HA  CB  HB2 HB3 CG  CD1 HD1 CE1 HE1 CZ  HZ  CE2 HE2 CD2 HD2 C   O   N   H
CA  HA  CB  HB2 HB3 CG  CD1 HD1 NE1 HE1 CE2 CZ2 HZ2 CH2 HH2 CZ3 HZ3 CE3 HE3 CD2
C   O   N   H   CA  HA  CB  HB2 HB3 CG  HG2 HG3 CD  OE1 OE2 C   O   N   H   CA
HA  CB  HB  CG1 HG11HG12HG13CG2 HG21HG22HG23C   O   N   H   CA  HA  CB  HB2 HB3
CG  HG2 HG3 CD  OE1 NE2 HE21HE22C   O   N   H   CA  HA  CB  HB2 HB3 CG  HG  CD1
HD11HD12HD13CD2 HD21HD22HD23C   O   N   H   CA  HA2 HA3 C   O   N   H   CA  HA
CB  HB  CG2 HG21HG22HG23CG1 HG12HG13CD1 HD11HD12HD13C   O   N   CD  HD2 HD3 CG
HG2 HG3 CB  HB2 HB3 CA  HA  C   O   N   H   CA  HA  CB  HB2 HB3 CG  ND1 CE1 HE1
NE2 HE2 CD2 HD2 C   O   N   CD  HD2 HD3 CG  HG2 HG3 CB  HB2 HB3 CA  HA  C   O
N   H   CA  HA  CB  HB1 HB2 HB3 C   O   N   H   CA  HA2 HA3 C   O   N   H   CA
HA  CB  HB2 HB3 CG  HG  CD1 HD11HD12HD13CD2 HD21HD22HD23C   O   N   H   CA  HA
CB  HB2 HB3 CG  HG2 HG3 CD  HD2 HD3 CE  HE2 HE3 NZ  HZ1 HZ2 HZ3 C   O   N   H
CA  HA  CB  HB2 HB3 CG  HG2 HG3 CD  HD2 HD3 CE  HE2 HE3 NZ  HZ1 HZ2 HZ3 C   O
N   H   CA  HA  CB  HB2 HB3 CG  HG2 HG3 CD  HD2 HD3 CE  HE2 HE3 NZ  HZ1 HZ2 HZ3

I'll do that with every fragment and keep you posted. Thanks for all the comments

SofiaBariami avatar Jan 09 '20 11:01 SofiaBariami

The length of the prm file was because of a bug at my script that was reading the parameters from the xml file. The problem was found at the 5-membered rings of the protein residues. The atom 1 and atom 3 of the ring can be considered either angled or dihedraled, depending on the trend (clockwise or anti-clockwise) we count them. My script did not take that into account, so many atoms were considered dihedraled and were assigned 1-4 interactions, when they were just angled (considering the shortest path convention). Now the fragments can be visualised with vmd as well.

SofiaBariami avatar Jan 09 '20 15:01 SofiaBariami

Hi, just a few comments. As you've noticed when converting QUBE into xml file format, we've treated a protein just as one big organic molecule, where every atom has the same residue number and name (because we need each atom to have unique nonbonded parameters). The atom names do then go to H1000 etc, but I guess we could limit to 4 characters with a bit of thought if it is causing problems. However, if you've got around it that sounds fine as long as we can still define each atom with its own charge etc. I will have to remind myself of the rest of the prm file format - feel free to send the full prm file across. I have also flagged this thread to Vadiraj who has implemented qube in xml format and may have comments.

djcole56 avatar Jan 09 '20 17:01 djcole56

You don’t need spaces around the names as 20A4 means read 20 4-character strings. This means you could have up to 9999 atom types. You can change the format too, e.g. 20A5 if you needed, although this may break some parsers (but not BioSimSpace - we do actually read and interpret these format statements).

However - if I remember correctly, the way non-bonded terms work is that the prm file includes the full atom_i x atom_j combined LJ parameters. This means that if you have 6000 different atom parameters then you will have 36 million Aij and Bij values for the LJ equation and explicitly included in the prm file. You will certainly have LJ pair matrices that have 36 million values in memory. Not to mention that every single bond, angle and dihedral will have its own parameter (as they are based on the combination of atom types), and so this will inflate the size of the prm file and the amount of memory consumed by the model in a simulation program...

chryswoods avatar Jan 10 '20 06:01 chryswoods

@chryswoods Great! This is good to know. I guess I never really had to know much about amber formats, or parsing amber files. Makes sense.

ppxasjsm avatar Jan 10 '20 09:01 ppxasjsm

Yes - I regret that a chunk of my brain and far too much of my time was lost to trying to make sense of molecular file formats... :-(

chryswoods avatar Jan 10 '20 17:01 chryswoods