metaerg
metaerg copied to clipboard
master.tsv.txt has no entries for kegg and metacyc pathway
This issue is possibly the same as in #4 but since it has not received any attention, I will open this issue.
After fixing my workflow as described in #10, both with and without docker the workflow runs without errors.
However, the master.tsv.txt
file has no entry in the columns kegg_pathway or metacyc_pathway, but those are annotated in master.gff.txt
(example for one entry below)
gnl|Prokka|BS_1 Prodigal_v2.6.3 CDS 3283 4338 . - 0 ID=metaerg.pl|00004;allec_ids=1.18.1.2;allgo_ids=GO:0016491,GO:0055114,GO:0004324,GO:0050660,GO:0050661;allko_ids=K00382,K00384,K00266;genomedb_OC=d__Bacteria%3Bp__Bacteroidota%3Bc__Bacteroidia%3Bo__Flavobacteriales%3Bf__Flavobacteriaceae%3Bg__MED-G13%3Bs__MED-G13 sp002697255;genomedb_acc=GCA_002697255.1;kegg_pathway_id=00240,00260,00251,00910,00010,00252,00620,00280,00020;kegg_pathway_name=Pyrimidine metabolism,Glycine%2C serine and threonine metabolism,Glutamate metabolism,Nitrogen metabolism,Glycolysis / Gluconeogenesis,Alanine and aspartate metabolism,Pyruvate metabolism,Valine%2C leucine and isoleucine degradation,Citrate cycle (TCA cycle);metacyc_pathway_id=PWY-101,PHOTOALL-PWY;metacyc_pathway_name=photosynthesis light reactions%3B,oxygenic photosynthesis%3B;metacyc_pathway_type=Electron-Transfer%3B Photosynthesis%3B,Photosynthesis%3B Super-Pathways%3B;pfam_acc=PF03486,PF13450,PF00070,PF07992,PF13738;pfam_desc=HI0933-like protein,NAD(P)-binding Rossmann-like domain,Pyridine nucleotide-disulphide oxidoreductase,Pyridine nucleotide-disulphide oxidoreductase,Pyridine nucleotide-disulphide oxidoreductase;pfam_id=HI0933_like,NAD_binding_8,Pyr_redox,Pyr_redox_2,Pyr_redox_3;sprot_desc=Ferredoxin--NADP reductase;sprot_id=sp|A5FJT9|FENR_FLAJ1
and in master.tbl.txt
4338 3283 CDS
ID metaerg.pl|00004
allec_ids 1.18.1.2;
allgo_ids GO:0016491; GO:0055114; GO:0004324; GO:0050660; GO:0050661;
allko_ids K00382; K00384; K00266;
genomedb_OC d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Flavobacteriales;f__Flavobacteriaceae;g__MED-G13;s__MED-G13 sp002697255;
genomedb_acc GCA_002697255.1;
genomedb_target db:genomedb|GCA_002697255.1|MAP99322.1 1 350 evalue:3.5e-155 qcov:99.70 identity:75.70;
kegg_pathway_id 00240; 00260; 00251; 00910; 00010; 00252; 00620; 00280; 00020;
kegg_pathway_name Pyrimidine metabolism; Glycine, serine and threonine metabolism; Glutamate metabolism; Nitrogen metabolism; Glycolysis / Gluconeogenesis; Alanine and aspartate metabolism; Pyruvate metabolism; Valine, leucine and isoleucine degradation; Citrate cycle (TCA cycle);
metacyc_pathway_id PWY-101; PHOTOALL-PWY;
metacyc_pathway_name photosynthesis light reactions;; oxygenic photosynthesis;;
metacyc_pathway_type Electron-Transfer; Photosynthesis;; Photosynthesis; Super-Pathways;;
pfam_acc PF03486; PF13450; PF00070; PF07992; PF13738;
pfam_desc HI0933-like protein; NAD(P)-binding Rossmann-like domain; Pyridine nucleotide-disulphide oxidoreductase; Pyridine nucleotide-disulphide oxidoreductase; Pyridine nucleotide-disulphide oxidoreductase;
pfam_id HI0933_like; NAD_binding_8; Pyr_redox; Pyr_redox_2; Pyr_redox_3;
pfam_target db:Pfam-A.hmm|PF03486.14 evalue:2.4e-06 score:31.2 best_domain_score:19.2 name:HI0933_like; db:Pfam-A.hmm|PF13450.6 evalue:0.0014 score:23.4 best_domain_score:21.5 name:NAD_binding_8; db:Pfam-A.hmm|PF00070.27 evalue:2.8e-11 score:48.4 best_domain_score:45.1 name:Pyr_redox; db:Pfam-A.hmm|PF07992.14 evalue:7.1e-29 score:105.7 best_domain_score:105.3 name:Pyr_redox_2; db:Pfam-A.hmm|PF13738.6 evalue:6.1e-23 score:86.2 best_domain_score:74.4 name:Pyr_redox_3;
sprot_desc Ferredoxin--NADP reductase;
sprot_id sp|A5FJT9|FENR_FLAJ1;
sprot_target db:uniprot_sprot|sp|A5FJT9|FENR_FLAJ1 1 350 evalue:1.5e-144 qcov:99.70 identity:70.00;
Any idea why this is the case and how this can be fixed?
Thank you very much.
Hello, I have posted this issue seven days ago, no activity so far. Could you tell me if this is going to be fixed any time soon? Any response would be highly appreciated. Thank you.
Hi, I don't know what will happen with this project in the future, but I will provide you with a fix. If I can't get it working within metaerg, I will give you an R script to extract this columns from the gff/tbl file. Best wishes, Lukas Jansen
Ok, I found the problem. The gene2pathways
hash(map) is always empty, the predictKegg/Metacyc functions have a local variable of the same name. Only the master.tsv file uses this hash. The pathways are added to the feature hash (f
). This hash is used for the other files.
Replace this in bin/output_reports.pl
:
https://github.com/xiaoli-dong/metaerg/blob/7d6f785dab2b776db0209f35eb6d974a67df1290/bin/output_reports.pl#L149-L156
With
https://github.com/xiaoli-dong/metaerg/blob/0f0d46995b62d12417c63ef4937d1b10ef12ec1b/bin/output_reports.pl#L143-L144
oh cool, thank you very much for your help, testing it now. :-)
almost solved: this is the current output (only showing the columns 2, 8 and 9):
feature_id kegg_pathways metacyc_pathways
BS|00001
BS|00002 00260
BS|00003 00650;00640;00280;00071;00632;03320;00930;01040;00410;00380;00310;00592
BS|00004 00240;00910;00252;00280;00020;00260;00251;00620;00010 PHOTOALL-PWY;PWY-101
How can the IDs be translated?
Good morning, The MetaCyc pathways don't really have names, at least easily available. Two ideas:
- The hack:
Instead of
kegg_pathway_id
usekegg_pathway_name
inoutput_reports.pl
- R with minpath data
#!/usr/bin/env Rscript
readMaster = function(infile) {
require(data.table)
master = fread(master_tsv_path, fill = T, sep = "\t")
# Weird col V17
if ("V17" %in% colnames(master) && master[, all(is.na(V17))])
master[, V17 := NULL]
master
}
translateIds = function(tolookup, lookuptable, sep = ";") {
require(data.table)
unnested = data.table(elements = strsplit(tolookup, sep),
row = seq_along(tolookup))
unnested = unnested[, .(elements = unlist(elements)), by = row]
res = merge(
unnested,
lookuptable,
by.x = "elements",
by.y = colnames(lookuptable)[[1]],
all.x = T
)
res = res[, .(collapsed = paste0(get(colnames(lookuptable)[[2]]), collapse = sep)), by =
row]
setorder(res, "row")
res$collapsed
}
readMinPath = function(infile) {
require(data.table)
minpath = fread(infile,
header = F,
fill = T,
sep = "")
# https://omics.informatics.indiana.edu/MinPath/output-readme.txt
minpath_regex = "path (\\d+) (.+) naive (0|1) minpath (0|1) fam0 (\\d+) fam-found (\\d+) name (.+)"
minpath_colnames = c(
"pathway_number",
"reconstruction",
"naive_recon",
"minpath_recon",
"total_fam_count",
"hit_fam_count",
"pathway_name"
)
minpath[, V1 := sub(minpath_regex, "\\1;\\2;\\3;\\4;\\5;\\6;\\7", V1)]
minpath[, (minpath_colnames) := tstrsplit(V1, ";", fixed = T)]
minpath[, V1 := NULL]
minpath
}
master_tsv_path = "master.tsv.txt"
result_path = "translated.tsv"
minpath_path = "cds.gene2ko.minpath.txt"
minpath = readMinPath(minpath_path)
lookuptable = minpath[, .(pathway_number, pathway_name)]
master = readMaster(master_tsv_path)
master[, kegg_pathways := translateIds(kegg_pathways, lookuptable)]
fwrite(master, result_path)
thank you for your help. I used the first hack and can confirm that it worked for me. I changed line 143 as follows:
my @kegg_pathways = $f->get_tag_values("kegg_pathway_name") if $f->has_tag("kegg_pathway_name");