SO-Ontologies
SO-Ontologies copied to clipboard
request for new terms related to bonds
Hi SO,
We're trying to improve our mappings of INSDC to SO terms, and I'd like to request a few additions and revisions related to protein bonds. These aren't exactly INSDC because INSDC only covers nucleotides, but they do exist in NCBI for some annotated proteins (e.g. from SwissProt or RefSeq), so I'd like to be able to map them.
disulfide_bond (SO:0001088) is under the "Obsolete Terms" part of the tree, but there doesn't seem to be a replacement. Am I missing one? We'd like to be able to generate GFF3 for annotated proteins, which may have disulfide bond features, so this would be a useful term.
Same for cross_link (SO:0001087)
NCBI also has: thioester bond thiolester bond -- this is very rarely used other
Depending on whether disulfide_bond and cross_link are ok to use, how about adding terms for "thioester_bond", and "bond" as a parent of disulfide_bond, cross_link, and thioester_bond?
Thanks,
-Terence Murphy
The thing about bonds is that they are not continuous sequence, but they are connections to different parts of a sequence or to different sequences and seemed to not quite fit in the SO. How are they used in INSDC? How do you overcome the separate residue issue, but connected feature issue?
Hi Karen,
Good point -- I was just thinking about SO and not what the full details of the representation should be.
NCBI has two sources of protein bond information: SwissProt/UniProtKB and PDB. UniProt's documentation on their different forms of disulfide bonds is here: http://www.uniprot.org/help/disulfid
That's specific for disulfide bonds, but I think it is generalizable. It breaks down into three bins:
- Intrachain
- Interchain, in a homodimer or two chains of the same precursor
- Interchain, in a heterodimer
1 & 2 can both be expressed in terms of locations on the same seq-id.
It looks like Intrachain is their default. So a NCBI flatfile representation like this: http://www.ncbi.nlm.nih.gov/protein/Q43495.1
Bond bond(41,77)
/bond_type="disulfide"
/experiment="experimental evidence, no additional details
recorded"
/note="{ECO:0000250}."
could be represented as two rows in GFF3, with the same ID:
Q43495.1 SwissProt disulfide_bond 41 41 . + . ID=1
Q43495.1 SwissProt disulfide_bond 77 77 . + . ID=1
Interchain between the same site in a homodimer looks like this in a flatfile view: http://www.ncbi.nlm.nih.gov/protein/P25703.1
Bond bond(362)
/gene="bmp2-a"
/bond_type="disulfide"
/experiment="experimental evidence, no additional details
recorded"
/note="Interchain. {ECO:0000250}."
or between different chains of the same sequence: http://www.ncbi.nlm.nih.gov/protein/P11140.2
Bond bond(247,269)
/bond_type="disulfide"
/experiment="experimental evidence, no additional details
recorded"
/note="Interchain (between A and B chains)."
Those are both a bit vague IMHO. Pragmatically, we'd likely just output GFF3 equivalent to the Intrachain form:
P25703.1 SwissProt disulfide_bond 362 362 . + . ID=2;Note=Interchain. {ECO:0000250}.
P11140.2 SwissProt disulfide_bond 247 247 . + . ID=3;Note=Interchain (between A and B chains).
P11140.2 SwissProt disulfide_bond 269 269 . + . ID=3;Note=Interchain (between A and B chains).
But perhaps something like the Target= attribute would be a better representation (with the caveat that figuring it out involves interpreting the Interchain term in the /note):
P25703.1 SwissProt disulfide_bond 362 362 . + . ID=4;Target=P25703.1 362 362 +;Note=Interchain. {ECO:0000250}.
P11140.2 SwissProt disulfide_bond 247 247 . + . ID=5;Target=P11140.2 269 269 +;Note=Interchain (between A and B chains).
P11140.2 SwissProt disulfide_bond 269 269 . + . ID=5;Target=P11140.2 247 247 +;Note=Interchain (between A and B chains).
The third bin of Interchain between two different sequences can be represented in our ASN.1 structure (I think it would show up as an inter-sequence join of some sort), but we don't seem to have any actual examples and the source data in SwissProt only represents the other sequence as an imprecise text string. Here's how it looks in our representation (which isn't that different than the UniProt source): http://www.ncbi.nlm.nih.gov/protein/P22030.2 http://www.ncbi.nlm.nih.gov/protein/P22029.2
Bond bond(75)
/bond_type="disulfide"
/experiment="experimental evidence, no additional details
recorded"
/note="Interchain (with C-80 in alpha chain)."
Bond bond(80)
/bond_type="disulfide"
/experiment="experimental evidence, no additional details
recorded"
/note="Interchain (with C-75 in beta chain)."
So that's actually problematic for my proposal to use Target, because it's not entirely obvious how to distinguish the Interchain cases that are on the same sequence vs. two different sequences, and what seq-id to use for the other sequence.
So now having worked through it, we'd probably just use the format I proposed in my first two examples (IDs 1, 2, and 3). It's not as robust as one would like, but that's the nature of the data and at least it can be interpreted to say "this residue is involved in a disulfide (or other type) bond."
Sound good? It does mean we'd still like to use the old SO terms for disulfide_bond and cross_link, and add "thioester_bond" and "bond".
-Terence
I think there are two things here. There are two residues and there is a connection between them.
In your examples, I do not see the connection between two regions, I see the annotation of a place that is one end of a connection. Am I reading that right? Could we have something in col 9 that would tell you the position of the other end? THis would be where you put the intra chain and homodimer stuff)
Do we need to also have a term for the actual linkage or not?
I think the terms for the sequences are: disulfide_bond_region (or residue) crosslink_region thioester_bond_region bond_region
--K
A bond is marking the two sites at the ends of the bond, and describing the connection. I think it's always a single residue on each end, so we're always talking about discrete sites and not regions. That is, "bond(41,77)" is connecting the single residue at position 41 to the single residue at position 77. It's not "bond(41..77)", which would convey a region.
I don't think that's conveyed very well by the "Graphical view" shown by UniProt, which makes it look like a region, but I think that's an artifact of using the same table formatting for both the "Molecule processing" (which shows regions of the sequence) and "Amino acid modifications" (which is for single residues): http://www.uniprot.org/uniprot/Q43495#ptm_processing
It makes a bit more sense in this NCBI graphical view: http://www.ncbi.nlm.nih.gov/protein/5902675?report=graph
But I see the NCBI display would benefit from some different graphical representation of Interchain: Interchain, same site example: https://goo.gl/HgsoaY Interchain, different sites example: https://goo.gl/fsKuk8
As for the markup, what do you think of something like the following? The easily-consumed info in columns 3/4/5 tell you that this residue is involved in a bond. The column 9 attributes provide info on what it's connected to, in all cases. I'm still using the same ID for the two rows on each side of the bond, for both Intrachain and Interchain, but I could see not doing that for Interchain.
Q43495.1 SwissProt disulfide_bond 41 41 . + . ID=1;bond_target=Q43495.1 77 77 +;bond_type=Intrachain
Q43495.1 SwissProt disulfide_bond 77 77 . + . ID=1;bond_target=Q43495.1 41 41 +;bond_type=Intrachain
P25703.1 SwissProt disulfide_bond 362 362 . + . ID=2;bond_target=P25703.1 362 362 +;bond_type=Interchain
P11140.2 SwissProt disulfide_bond 247 247 . + . ID=3;bond_target=P11140.2 269 269 +; bond_type=Interchain;Note=Interchain (between A and B chains).
P11140.2 SwissProt disulfide_bond 269 269 . + . ID=3;bond_target=P11140.2 247 247 +; bond_type=Interchain;Note=Interchain (between A and B chains).
And this reminded me of one more bond type that isn't in SO: heterogen (abbreviated "Het"). These originate from the PDB database, and are described here: http://www.bmsc.washington.edu/CrystaLinks/man/pdb/part_37.html And NCBI's definition is here: http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/SEQFEAT.HTML "In the PDB structural database, non-biopolymer atoms associated with a Bioseq are referred to as "heterogens". When a heterogen appears as a feature, it is assumed to be bonded to the sequence positions in Seq-feat.location. If there is no specific bonding information, the heterogen will appear as a descriptor of the Bioseq. The Seq-loc for the Seq-feat.location will probably be a point or points, not a bond. A Seq-loc of type bond is between sequence residues."
Basically, it's representing bonds to things that aren't proteins, like Zn or chlorophyl or heme, with one or more points on the protein that are bonded to the small molecule. Here's an example: http://www.ncbi.nlm.nih.gov/protein/5FQD_E
Het join(bond(307),bond(310),bond(375),bond(378))
/heterogen="( ZN,1002 )"
We had some discussion a while back about changing this to:
Het order(307,310,375,378)
/heterogen="( ZN,1002 )"
I haven't been able to figure out what the "1002" refers to -- it may be a unique ID for a particular bonding point on the ion. I did find this page with info about the small molecules (and others) with the abbreviations used in the data we get from PDB: http://www.rcsb.org/pdb/ligand/ligandsummary.do?hetId=ZN
A very simplistic representation (matching my knowledge of the subject) could be like this:
5FQD_E PDB heterogen 307 307 . + . ID=7;heterogen=( ZN,1002 )
5FQD_E PDB heterogen 310 310 . + . ID=7;heterogen=( ZN,1002 )
5FQD_E PDB heterogen 375 375 . + . ID=7;heterogen=( ZN,1002 )
5FQD_E PDB heterogen 378 378 . + . ID=7;heterogen=( ZN,1002 )
And all that would need is an SO_term "heterogen", as another child of "bond", and it would at least convey which sites on the protein are involved in the bonding event and a user could deduce what molecule it's binding to (and probably not have a clue what the "1002" means). And it wouldn't take any special coding to generate.
-Terence
Hi Terrance Thanks for clarifying. I am still struggling honestly with this. The thing you want to convey is not contiguous. It is at two places. We could have terms for the residues that are the points of contact. These would be children of polypeptide region.
The actual bond does not fit the definition of a sequence feature : Any extent of continuous biological sequence. Because it is the link between two separate sequence features, and not necessarily on the same chain.
GFF3 does not have a way to annotate a bond as a feature - as each line corresponds to a feature, not a collection of features. I think this is true for all flavors of GFF/GTF.
We could annotate the bond_site (Or bond start_site and bond_end site) and then note in column 9 what the target is.
We could also add heterogen as a kind of bond_site.
Does this make sense? Sorry it is not elegant.
--Karen