DRAM
DRAM copied to clipboard
How to prepare a single .pep KEGG database for DRAM database setup step
Hi,
I have a KEGG subscription and want to incorporate it into my DRAM setup. I tried setting up DRAM using a concatenation of the different .pep KEGG files to produce a single KEGG database DRAM didn't like it. It worked fine if only pointing to one of them.
What's the best way to combine these databases to make a single .pep file for DRAM setup? Or can I point to multiple .pep files upon configuration?
Thanks!
Sorry for the late reply, and I am sorry that KEGG is so poorly documented. There are 2 ways, you can use the gene_ko_link_loc which should work with a concatenated pep file. I however like to use this script, it is easier to check. The kegg db can change on occasion, so it does require a more hands on approach.
You can download the stuff you need with something like this.
wget -r -A "*\.pep\.gz" ftp://$UNAME:[email protected]/kegg/genes/organisms
wget -r -A "*\.kff\.gz" ftp://$UNAME:[email protected]/kegg/genes/organisms
gzip -d ftp.kegg.net/kegg/genes/organisms/*/*pep.gz
gzip -d ftp.kegg.net/kegg/genes/organisms/*/*\.kff\.gz
cat ftp.kegg.net/kegg/genes/organisms/*/*pep >> kegg-all-orgs_20220129.pep
cat ftp.kegg.net/kegg/genes/organisms/*/*kff >> kegg-all-orgs_20220129.kff
If you have any problems at all, let me know, and I will walk you though it. I will be watching with interest remove_kegg_dups_and_rehead.py.gz
Hi, thanks for this!
Out of interest, do I need the .kff files to configure DRAM to use the corresponding KEGG databases? I have the following .pep files that I plan to concatenate and configure DRAM to use, but I can't see any equivalent .kff files on the KEGG server?
eukaryotes.pep.gz family_eukaryotes.pep.gz genus_eukaryotes.pep.gz genus_prokaryotes.pep.gz prokaryotes.pep.gz species_prokaryotes.pep.gz T10000.pep.gz T40000.pep.gz
Thanks in advance!
Good question! The problem we have with kegg is that they don't always put the KOs, which is what we mostly use, in the headers, which is what we pull. The other problem is that, depending on your data set, the headers and genes themselves can also be repeated. If you want every KO possible in the output, then I suggest that you may want to get the full set of kff files and use my script or the KO link argument. There will be some computation wasted, as the combined kff file will be larger than your db, however making a custom script, or manually finding each kff file you need based on the KEGG id and matching it would require more of your time than it would be worth. A kff file made with the wget, gzip and concatenate commands above should have all gene ids and KOs. If the KOs that are missing are not important, you only need to de duplicate your concatenated fasta. You could modify my script, or use some other tool to do that. If you don't add kos then dram should still work fine as long as you de-duplicate. The annotations files will have fewer KO number and that will change your results including in the distill step, but the software will run no problem. If you de-duplicate the concatenated fasta and DRAM setup does not work, then we have a problem outside this discussion entirely, and we will deal with that!
OK thanks, will concatenate these, attempt to de-duplicate, setup databases and report back!
Good luck!