metaerg
metaerg copied to clipboard
issues with KEGG annotations and KO mapping
Hi,
I have used metaerg to annotate metagenome bins, and have found some issues with the KEGG annotations.
I tested out a Methanogen MAG. It has the McrA gene and its correctly annotated in the cds.faa file. However, this does not show up in the KEGG map.
metaerg.pl|00402 Methyl-coenzyme M reductase II subunit alpha OS=s__Methanothermobacter_sp000828575 len=553 cid=Contig_2,9,4,3_consensus_sequence
We then tried to use the cds.gene2ko.mapping.txt The McrA gene comes up here, but this file appears to be over annotated and adds in functions that are not present in the genome - e.g. DsrA, K11180. The gene with this KO added is metaerg.pl|01003 - which is correctly annotated as >metaerg.pl|01003 Coenzyme F420-dependent sulfite reductase OS=s__Methanothermobacter_thermautotrophicus len=689 cid=Contig_1_consensus_sequence in the cds.faa file (KO K21816 with blastkoala)
Have anyone else noted discrepancies like this? I have attached the data folder from the analysis.
Camilla
Hi, For a recent paper and my bachelor thesis we used MetaErg and other annotation tools to annotate anaerobic digestion metagenomes from biogas plants and I have to say that a similar problem often occurs for other pathways as well. Disulfid reductase and other key anaerobic digestion genes appear to be not well represented in annotation databases. Sadly it seems like a human check will always be necessary.
MetaErg performs its KO annotation by a BLAST search against a UniProt database. While this is a valid approach, only reference genes also present in KEGG have KO terms annotated.
I like KofamKOALA. It is also not perfect, but they tried to reduce false positives and can be run offline. Sometimes even the reference genomes miss a annotation, making certain pathways not complete. https://academic.oup.com/bioinformatics/article/36/7/2251/5631907
eggNOGmapper can also provide you with EC numbers and KO terms, but many are automatically generated and appear to be quite questionable sometimes (especially KO terms), but the orthologs can be compared among different genomes and help to identify certain genes https://github.com/eggnogdb/eggnog-mapper
InterProScan pathway annotations also contain EC numbers and those often helped me to complete certain pathways. https://github.com/ebi-pf-team/interproscan
Hopefully this helps in one way or another, Lukas Jansen
MetaErg also adds the FOAM and TIGRFAM KO hits, wasn't sure, but just found it in the source code.
Hi, Thanks for your comments Lukas. What I find the most puzzling (and problematic) with metaerg is the difference in what it draws in the maps provided and what is present in the KO list in the cds.gene2ko.mapping.txt file since I assume they should be identified by the same BLAST search against a UniProt database ? We are also using Metasanity for annotation which seems to arrive at better results for functional annotation. (I really like Metaergs taxonomy of CDSs though, so it would be great if the KEGG part was as useful). Camilla
Hi, Thanks for sharing Metasanity, will have a look. Regarding the maps, maybe MetaErg only uses a subset of annotated KOs? Will search for that in the source.
Found the issue. The report uses the KOs after they have been selected by MinPath, while the cds.gene2ko.mapping.txt
and cds.gene2ko.tab.txt
contain all entries:
Using your data:
Sizes
report 837
gene2ko.mapping 1476
gene2ko.tab 1476
gene2ko_minpath 837
Similarity (%):
report gene2ko.mapping 56.71
report gene2ko.tab 56.71
report gene2ko_minpath 100.0
gene2ko.mapping gene2ko.tab 100.0
gene2ko.mapping gene2ko_minpath 56.71
gene2ko.tab gene2ko_minpath 56.71
#!/usr/bin/env python3
import json
from io import StringIO
import re
import itertools
# Table used for the fam_found entries:
reporttable = 'data/data/kegg.cds.profile.datatable.json'
kolist = 'data/data/cds.gene2ko.mapping.txt'
tab = 'data/data/cds.gene2ko.tab.txt'
minpath = 'data/data/cds.gene2ko.minpath.details'
def kosfromtable(filepath):
with open(filepath, 'rt') as jsonin:
# Memory inefficient, but ok for now
jsonstr = jsonin.read()
jsonstr = re.findall(r'data = (\[.+\])',jsonstr, re.DOTALL)[0]
jsonstr = re.sub(r',\s*}','}',jsonstr)
jsonstr = json.load(StringIO(jsonstr))
kos = {ko for path in jsonstr for ko in path['kos'].split('+')}
return kos
def kosfromgene2ko(filepath,i=1,skip=False):
with open(filepath, 'rt') as jsonin:
if skip:
next(jsonin)
kos = {line.split('\t')[i].strip() for line in jsonin }
return kos
kohit = re.compile(r'^ (K\d{5}) .+')
def kosfromminpath(filepath):
with open(filepath, 'rt') as jsonin:
kos = {line.group(1) for line in map(kohit.match, jsonin) if line is not None}
return kos
def jaccardsim(A: set, B: set):
return len(A.intersection(B))/len(A.union(B))*100
def main():
kos = {
'report': kosfromtable(reporttable),
'gene2ko.mapping': kosfromgene2ko(kolist),
'gene2ko.tab': kosfromgene2ko(tab,8,True),
'gene2ko_minpath': kosfromminpath(minpath)
}
print('Sizes:')
for koset in kos.keys():
print(f'{koset}\t{len(kos[koset])}')
print('\nSimilarity:')
for A, B in itertools.combinations(kos.keys(),2):
print('\t'.join((A,B,str(round(jaccardsim(kos[A],kos[B]),2)))))
if __name__ == '__main__':
main()
Thanks - that makes sense! Thanks you so much for sorting this out for me.