Issue with NEB and NEBTS on google colab
Hello RagnarB83!
I would like to report an issue I am currently facing while working with ASH on google colab, specifically where ASH seems to be making the images for NEB just fine (when I download the idpp_interpolation_path, the images are the exact same as the ones I obtained after running ASE on the same reactants and products), but after initial SCF energy calculations, something seems to be going wrong. The error reads, for both PySCF using CPU and GPU4PySCF : "UnboundLocalError: cannot access local variable 'checkpoint' where it is not associated with a value".
I ran across the error for both NEB and NEBTS, and I have attached the full notebook here, which also display the respective outputs : https://colab.research.google.com/drive/1Mnu9vouI-xloaNgZl5TCUA6hSEPavrTX?usp=sharing
I do not have access to computing clusters and use a medium spec-d laptop, and hence have not run the same calculations on my laptop, so I do not know if this is an ASH issue, a colab issue, or issue with the way I have organized the code.
Could you kindly help me with solving this? I intend to use these colab notebooks for QM and QM/MM simulations of enzyme active sites later on.
Thanks, Neil
Hi Neil Thanks for reporting. It was a small bug in the pySCF interface that had gone unnoticed for a coupled of weeks. I just checked in a bugfix to the NEW branch. Please try it out in the NEW branch.
Fix will be merged with the master branch in 1-2 weeks.
Greetings sir,
Thank you very much for your response! Before your response, however, I had just changed the printlevel from "printlevel=0" to "printlevel=2" and that allowed the calculation to run. I am trying to run NEBTS again but with xtbtheory, and have run into a different issue, the screenshot of which I've included below :
I have also copy pasted the input code cells and output to the same google colab notebook : https://colab.research.google.com/drive/1Mnu9vouI-xloaNgZl5TCUA6hSEPavrTX?usp=sharing
Could you kindly again help me with this?
Thank you, Neil
Yes, there was a bug related to printing so changing printlevel would have worked. So this new error is related to the QM-program, xTB, failing to converge the SCF for that particular geometry. When this happens, it is usually due to the geometry being difficult (atoms too close or bonds stretched etc.) and usually this means something has gone wrong. Since it's an NEB job, it means that the geometry of this particular image from this particular NEB iteration has been predicted to be bad. I would take a look at the geometry and confirm this. But I would also take a look at the current pathway for the last iteration written to disk in every iteration as: knarr_current.xyz
Inspecting this trajectory may reveal the problem and in fact inspecting this geometry from the original interpolation or the first NEB iterations is recommended to make sure things are behaving well.
The reason why this happens could be :
- reactant and products are not minimized at the same level of theory, resulting in a badly matched energy surface.
- Atom indices for reactant and product do not match
- the theory-level is not capable of describing the reaction well
- too few images to describe the complicated reaction well enough, increasing the number of images may help.
Greetings sir,
Thank you for your response and apologies for my (very) late reply. I had fixed the issue by minimizing the reactants and products, and subsequently performing all calculations, using xTB, as you suggested. This seemed to have fixed the problem and I got an estimate of the TS using the NEBTS function, though there are still errors in the structure that I have to try to fix. Thanks for the suggestions!
I am currently also attempting to run QM/MM calculations, and there are 2 issues that I faced :
- If I use CHARMM-GUI to build by simulation box for QM/MM, everything goes according to plan, except for the last step when I have to define QM atoms, where ASH fails to read Cu, and instead reads it as C. For reference, the line in the PDB line is as follows :
<ATOM 1832 CU2P CU2P 201 -9.803 -2.129 5.729 1.00 8.15 HETA>
I tried messing around with the format, but ASH read the same atom as H, He, or C.
I do not know if this is a bug in how ASH is reading the pdb file, or if a separate input needs to be provided for QM/MM. I have attached the PDB file generated by CHARMM-GUI for reference. I eventually had to build the molecule using OpenMM modeller instead due to this issue.
psf_protonated_psazurin_step3pbc.pdb
- For running QM/MM with PySCF theory with GPU4PYSCF, ASH mentions that I require PBC_lattice_vectors for the system. I apologize if this is a basic question, but how does one compute or obtain these vectors? Additionally, can documentation on this part be added to the PySCF theory page?
I had one last clarification. In QM/MM, you separate between the active region and QM region. My assumption based on the explanation in the QM/MM theory webpage is that the active region is treated with MM forcefields. Is this assumption correct? Additionally, since the active region needn't cover the entire protein, you often have QM (qm region) - MM (active region) - Frozen region during simulations. Is this correct? I apologise again if these are basic questions, I am quite new to all of this.
Thank you again for your help. Have a great day, Neil
Regarding issue 1: The PDB-file reading capabilities in ASH are now in hands of the OpenMM library that usually does a pretty good job of parsing information correctly. The issue is that if element information for each atom is not available in the file then OpenMM has to guess the element symbol and here it guesses carbon instead of copper (pretty good guess actually).
This type of problem can be fixed in 2 ways. a. Add element information to the PDB-file where required. Weird that CHARMM-GUI did not do this. You simply add the element symbol in the appropriate column (77-78).
ATOM 1830 OT1 ASN 124 9.212 -9.170 -8.642 1.00 0.00 PROA
ATOM 1831 OT2 ASN 124 10.628 -8.745 -6.979 1.00 0.00 PROA
ATOM 1832 CU2P CU2P 201 -9.803 -2.129 5.729 1.00 8.15 HETACU
b. You can always manually change the information in the Fragment file:
pdbfile="psf_protonated_psazurin_step3pbc_mod.pdb"
frag = Fragment(pdbfile=pdbfile)
frag.elems[1831]='Cu'
Regarding issue 2: QM/MM with gpu4pyscf in ASH is a bit untested at the moment, so no guarantee that it works. I will try to get around to it. Since QM/MM pointcharge handling in gpu4pyscf does Ewald summation one needs to provide periodic information to gpu4pyscf. In the ASH interface you do this via the PBC_lattice_vectors keyword. This should be a 3x3 array with the lattice vectors in Angstrom. If you have a cubic periodic box of your whole solvated biomolecule then you already have this information. If you have a nonperiodic system then you have to estimate a suitable box size larger than the system. Here you can see a basic QM/MM gpu4pyscf example: https://github.com/pyscf/gpu4pyscf/blob/master/examples/24-qmmm_pbc.py They define a simply 12 Angstrom cubic box via: np.eye(3)*12
Final question:
So an active region is actually unrelated to QM/MM theory but relevant when doing typical QM/MM calculations. It simply refers to which atoms are active during a geometry optimization (or possibly a dynamics simulation). Everything not in the active-region is frozen (not allowed to move but still included in the calculation). If you define an active region as a list of atom indices to the Optimizer then this means that only those atoms will be optimized and all the other atoms will be frozen (technical note: behind the scenes the Optimizer only sees the active atoms). The active atoms that you define is entirely up to you, depending on what you want to do, the atoms can be in the QM-region or MM-region. You can also define an active-region for a non-QM/MM calculation, e.g. a QM-calculation of a simple molecule, this would do a local geometry optimization of a part of a molecule. An active-region thus has nothing to do with QM/MM.
However, discussion of an active region in QM/MM geometry optimizations (or possibly MD simulations) comes up because the system is large and you have to make a choice which degrees of freedom you want to optimize and which ones you want to keep frozen. It is actually quite difficult to do a geometry optimization of a large protein (even if doing QM/MM) due to the many degrees of freedom present (tough minimization problem) so seldom done. Instead we define an active region that optimizes the part of the system that we care about the most. Since we are doing QM/MM we already care the most about the atoms that we have decided are in the QM-region so obviously the QM-region atoms should be part of the active region. However, we do not have to restrict our active-region to only include the QM-region. It can also include part of the MM-region closest to the QM-region. This allows one to reduce the otherwise harsh constraint of only minimizing QM-atoms. It is common to have the active region be up to 1000 atoms in size (much larger than the QM-region that might be 10-100 atoms). Here is a recent paper discussing the impact of the size of the active region on a reaction profile: https://pubs.acs.org/doi/full/10.1021/acs.jctc.7b00768
If doing MD simulations we do not have to solve a tough minimization problem but we simply propagate the atomic positions and velocities. We can thus choose to not define an active region at all and allow everything to be active which is probably the most realistic. We can also choose to freeze some regions of the system for some specific reason (i.e. the active region is everything else). When doing MM MD of a metalloprotein and if we don't have full forcefield parameters for the metal-cofactor it is common to freeze the metal active site for that reason.
Hope this helps.