sidechainnet icon indicating copy to clipboard operation
sidechainnet copied to clipboard

Generating SidechainNet angles for arbitrary PDB files

Open AndreeaMusat opened this issue 2 years ago • 6 comments

Hi,

I am interested in using MP-Nerf for a custom dataset and in order to do this I would need to have the functionality of transforming arbitrary PDB entries in the 'SidechainNet format'. Thus, I am looking to learn more about the data processing in SidechainNet, in particular about the angles.

From your documentation, I know that the angles for a protein with L residues will have the shape L x 12, where:

  • backbone torsion angles are L x 3
  • backbone bond angles are L x 3
  • sidechain torsion angles L x 6

My questions would be:

  • is the order of the backbone torsion angles in this format phi, psi, omega?
  • is the order of the bond angles C-N-CA, N-CA-C, and CA-C-N ?
  • do you have some details about which specific angles you count as chi1, chi2 etc? I am asking because in other sources I only see chi angles between chi1 and chi5, whereas you have 6. For example, I would be interested if you are considering the angles from here or more (eg: here they also count altchi1 and altchi2 which are not present in the previous list)
  • finally, do you maybe have some tools to generate the SidechainNet data for arbitrary PDB files or would we have to write our own?

Many thanks! Andreea

AndreeaMusat avatar Apr 20 '22 10:04 AndreeaMusat

Hi Andreea,

Thank you for your interest in SidechainNet! I'm glad you found it could be of use. If you continue your work on MP-Nerf, I would be interested in learning how you used SidechainNet with it.

I realize that the documentation on the master branch is a bit lacking with regard to your questions. I have just pushed an update to the master branch that should describe in detail how the data is organized. For your convenience, the file is here, and I have copied and pasted the relevant excerpt below. I believe that it should answer your first 3 questions.

SIDECHAIN DATA FORMAT
    Angle vectors (NUM_ANGLES) are:
        [phi, psi, omega, n-ca-c, ca-c-n, c-n-ca, x0, x1, x2, x3, x4, x5]
        [   bb torsion  |    bb 3-atom angles   |   sidechain torsion   ]
        Notes:
            - x0 is defined as the torsional angle used to place CB using
            (Ci-1, N, CA, CB) or (Ni+1, Ci+1, CA, CB) depending on whether or not the
            previous or following residue is available for measurement.
            - x5 is used to place NH1 or NH2 of Arginine.
            - if a given angle is not present, it is recorded with a GLOBAL_PAD_CHAR.
    Coordinate vectors (NUM_COORDS_PER_RES x 3) for resname RES are:
        [N, CA, C, O, *SC_BUILD_INFO[RES]['atom_names'], <PAD> * (N_PAD)]
        [ backbone  |          sidechain atoms         |     padding*   ]
        where each vector is padded with GLOBAL_PAD_CHAR to maximum length.
        for example, the atoms for an ASN residue are:
            [N, CA, C, O, CB, CG, OD1, ND2, PAD, PAD, PAD, PAD, PAD, PAD]

To answer your final question, I am happy to push an update with the required code. But before I do, can you verify what you need the function to do?

Here is my guess at what the function should do. Please let me know if I interpreted your request incorrectly. You could also directly use this function in your own code without me updating SidechainNet immediately if you wish.

The code is longer than necessary because I wanted to make sure it could be easily understood.


import prody as pr
import sidechainnet as scn
from sidechainnet.utils.download import get_resolution_from_pdbid

def process_pdb(filename, pdbid="", include_resolution=False):
    """Return a dictionary containing SidechainNet-relevant data for a given PDB file.

    Args:
        filename (str): Path to existing PDB file.
        pdbid (str): 4-letter string representing the PDB Identifier.
        include_resolution (bool, default=False): If True, query the PDB for the protein
            structure resolution based off of the given pdb_id.

    Returns:
        scndata (dict): A dictionary holding the parsed data attributes of the protein
        structure. Below is a description of the keys:

            The key 'seq' is a 1-letter amino acid sequence.
            The key 'coords' is a (L x NUM_COORDS_PER_RES) x 3 numpy array.
            The key 'angs' is a L x NUM_ANGLES numpy array.
            The key 'is_nonstd' is a L x 1 numpy array with binary values. 1 represents
                that the amino acid at that position was a non-standard amino acid that
                has been modified by SidechainNet into its standard form.
            The key 'unmodified_seq' refers to the original amino acid sequence
                of the protein structure. Some non-standard amino acids are converted into
                their standard form by SidechainNet before measurement. In this case, the
                unmodified_seq variable will contain the original (3-letter code) seq.
            The key 'resolution' is the resolution of the structure as listed on the PDB.
    """
    # First, use Prody to parse the PDB file
    chain = pr.parsePDB(filename)
    # Next, use SidechainNet to make the relevant measurements given the Prody chain obj
    (dihedrals_np, coords_np, observed_sequence, unmodified_sequence,
     is_nonstd) = scn.utils.measure.get_seq_coords_and_angles(chain, replace_nonstd=True)
    scndata = {
        "coords": coords_np,
        "angs": dihedrals_np,
        "seq": observed_sequence,
        "unmodified_seq": unmodified_sequence,
        "is_nonstd": is_nonstd
    }
    # If requested, look up the resolution of the given PDB ID
    if include_resolution:
        assert pdbid, "You must provide a PDB ID to look up the resolution."
        scndata['resolution'] = get_resolution_from_pdbid(pdbid)
    return scndata

I am happy to help if you have any more thoughts or concerns!

Cheers, Jonathan

jonathanking avatar Apr 21 '22 20:04 jonathanking

Here is a photo showing my definition of X0. In my quest to make a complete recording of every structure, I had decided to measure this angle (rather than utilize phi) because there was some deviation if I only used phi to describe the position of both the beta Carbon and the carbonyl carbon.

image

jonathanking avatar Apr 21 '22 20:04 jonathanking

Hi Jonathan,

Many thanks for your prompt reply and for your explanations!

Regarding the function that gets a PDB file and generates sidechainnet data -- I think it misses some of the fields (that I wasn't aware of before) that the official SidechainNet data has.

MP-Nerf is using a SidechainNet Dataloader which is built on a SidechainNet ProteinDataset. The scn_data_split dict in ProteinDataset here has more fields than what you provided above in process_pdb() above, e.g.: 'evo', 'ids', 'res', 'mod', 'msk'.

Do you maybe also have the code for extracting/generating these or, if not, could you kindly help us figure these out?

Many thanks again! Andreea

PS: I will follow up in a few months if we manage to get some results using SidechainNet and MP-Nerf :)

AndreeaMusat avatar Apr 22 '22 13:04 AndreeaMusat

Hey Andreea,

You're welcome!

Your question has actually raised a few questions on my end. I can partially answer your question, but I'm sorry to say that I've just realized that the function above is not completely correct.

In particular, the function above does not understand things like PDB files with multiple peptide chains, or SEQRES records (which are necessary to determine missing residue locations and mask info in lieu of a more advanced alternative). As a result, any gaps in a PDB file it processes will be collapsed/ignored. It will work fine for structures without missing residues, but please allow me a moment to correct it for the general case. I'm sorry for the confusion!

Would you mind sharing more information about the files you are using? Can you point me towards them or share an example? I'm curious if they contain SEQRES records and/or multiple chains.

Best, Jonathan

Partial Answer

Some data fields are not generated from SidechainNet's own utilities. This is due to the fact that SidechainNet extends earlier work called ProteinNet. Two of the fields you mentioned are only included in SidechainNet by way of borrowing the data directly from ProteinNet:

  • 'evo': Position Specific Scoring Matrices (PSSMs) + information content
  • 'sec': DSSP Secondary structure information

I would like to have my own methods for generating these fields, but I have not created any at this time.

The key 'ids' is simply the ProteinNet/SidechainNet protein identifier (which may not make sense to generate if your protein is not a part of a predetermined dataset split within ProteinNet/SidechainNet). The format for these is described in the Colab walkthrough here.

The 'res' key is the resolution which I have built into the process_pdb() function. 'mod' is the same as 'is_modified'.

jonathanking avatar Apr 22 '22 15:04 jonathanking

Hi Jonathan,

Thanks again for your help!

To answer your questions:

  • there might be multiple chains -- I'll look if it's possible to only consider proteins with a single chain, not sure how much this will affect the size of the dataset

  • the dataset is ApoBind which is not public yet afaik, but I can share a couple of examples (original: original1, original2, processed: processed1, processed2 ) -- the 'processed' one is obtained by performing a sequence alignment between two (slightly different) conformations of a protein and by keeping the common atoms between the two. As both conformations most likely had missing atoms, the processed one also has missing atoms, but we always make sure that the residues that remain have the C, N and CA backbone atoms.

  • there are no SEQRES records

Actually, at a closer look, for MP-Nerf, I think the important missing field would be 'msk' (it's used here ), I haven't seen the others being used. We'd have to write our own modified version of the dataset + dataloader, but the MP-Nerf stuff would remain unchanged.

AndreeaMusat avatar Apr 22 '22 16:04 AndreeaMusat

Thanks for the follow-up and for sharing some examples! Here are some comments.

SidechainNet and ProteinNet treat all protein chains independently. 1 chain per data entry. Our proposed function would need to return multiple entries if it encounters multiple chains. That should not be too hard to implement.

With regards to the files, here's the issue that perhaps we can figure out together. The 'msk' key is a missing residue mask. There are 1s when entire residues are present, and 0s when entire residues are missing. How can you tell from an arbitrary PDB file what atoms (and/or residues) are missing? In my work, I used Prody to parse the PDB file and observe all the data that was present. Then, by comparing this data with the sequence of that protein provided by ProteinNet (it may also be provided by SEQRES records), we can surmise which residues are missing. I believe you may also be able to tell which residues are missing (or at least where the gaps are) by looking at the distances between alpha-carbons, though I do not know exactly how to do this or if another program has already implemented it.

You tell me that the files have missing atoms. How can one know this by looking at the files? If you know which atoms are missing a priori, then we can use this information to compute the binary missing residue mask that is required.

Please let me know your thoughts!

jonathanking avatar Apr 22 '22 16:04 jonathanking