DRAM icon indicating copy to clipboard operation
DRAM copied to clipboard

only CAZy annotation information in distilled files

Open gruningerrj opened this issue 4 years ago • 4 comments

I appear to be having the same issue as #53 with richly annotated MAGS but only CAZy annotation information in the final distillate. I note the this was solved by adding KOs to the KEGG descriptions however I am not sure what files to find this information in and the easiest way to add this information to the appropriate files. I would be grateful if anyone could help me figure out how to do this.

Thanks

gruningerrj avatar May 14 '21 14:05 gruningerrj

When you set up DRAM did you give it the location of your KEGG proteins file or are you using KOfam?

shafferm avatar May 19 '21 16:05 shafferm

yes. I gave it to the location file prokaryotes.pep in the KEGG database. A colleague that use KOfam before we had access to KEGG didn't have any trouble. I am not sure if I used the wrong KEGG protein file?

gruningerrj avatar May 19 '21 16:05 gruningerrj

This maybe because of a change in format of the KEGG pep file headers. We haven't renewed our KEGG subscription for a bit over a year so that could be causing the issue. Could you run grep '>' prokaryotes.pep | head and share the output? You could potentially fix this issue by providing the --gene_ko_link_loc flag during set up. This is a file that has all gene IDs and KO IDs in a two column file. I can't remember where in the KEGG flat file database it's stored but I think it's called something like genes_ko.list.gz. I could help you rerun the processing of KEGG with that file added so that you don't need to rerun all of it if that would help.

shafferm avatar May 19 '21 17:05 shafferm

Here is the output from prokaryotes.pep

eco:b0001 thrL; thr operon leader peptide eco:b0002 thrA; fused aspartate kinase/homoserine dehydrogenase 1 eco:b0003 thrB; homoserine kinase eco:b0004 thrC; threonine synthase eco:b0005 yaaX; DUF2502 domain-containing protein YaaX eco:b0006 yaaA; peroxide stress resistance protein YaaA eco:b0007 yaaJ; putative transporter YaaJ eco:b0008 talB; transaldolase B eco:b0009 mog; molybdopterin adenylyltransferase eco:b0010 satP; acetate/succinate:H(+) symporter

The format of the genes_ko.list is below grep 'eco:' genes_ko.list | head eco:b3957 ko:K01438 eco:b3958 ko:K00145 eco:b3959 ko:K00930 eco:b3962 ko:K00322 eco:b3968 ko:K01977 eco:b3970 ko:K01980 eco:b3971 ko:K01985 eco:b3980 ko:K02358 eco:b3981 ko:K03073 eco:b3982 ko:K02601

gruningerrj avatar May 19 '21 17:05 gruningerrj

The use of kegg is considered advanced still, but we have tools that can be used to build compatible pep files. I hope this process will become more streamlined in the future. Until then, I will call this issue closed, as it is not relevant to the current code base.

rmFlynn avatar Dec 12 '22 20:12 rmFlynn

It might be good to show somewhere in the documentation what you think the kegg.pep headers should look like, particularly for genes that more than one KO :)

mw55309 avatar Oct 24 '23 07:10 mw55309