DRAM
DRAM copied to clipboard
Custom database
Hi,
On the main github page it says DRAM can also use custom user databases. Is it explained somewhere how to do this?
I have fasta files containing protein sequences for a number of metabolic marker genes that I would like to look for in my MAGs. These fasta files are titled for example Fe_hydrogenase.fasta and look like so:
>WP_004030875.1 - Methanobacterium formicicum - [Fe]
MKLAILGAGCYRTHAASGITNFSRACEVAEQVGKPEIAMTHSTIAMGAELKELAGIDEIVVSDPVFDNDFTVIDDFEYEAVIEAHKKDPESIMPQIREKVNAVAKDLPKPPKGAIHFTHPEDLGFEVTTDDNEAVQDADWVMTWFPKGDMQMGIIKEFADNLKEGAILTHACTVPTTMFQKIFEDLSSDEMNIAPKVNVSSYHPGAVPEMKGQVYIAEGYASEDAICKLVDWGVAARGDAFKLPAELLGPVCDMCSALTAITYAGILSYRDSVMNILGAPAGFAQMMAKESLTQVTDLMNSVGIDHMEEKLDPGALLGTADSMNFGAAADVLPSVLEVLENRKGKGPTCNI
>WP_012955328.1 - Methanobrevibacter ruminantium - [Fe]
MKVAILGAGCYRTHAASGITNFSRACEVADATGKENISMTHSTIEMGAELLELAGVDEVVVADPVFDGEFTVVEDFDYAEVIAAHKAGNPEDVMPAIRAKVGELAETVPKPANGAIHFTHPEDLGMKCTTDDREAVADADWIMTWLPEGGMQPAIIEKFADVIKDGAIVTSACTIPTPGLNQIFEDLGKNVNVASYHPGAVPEMKGQVYIAEGFADQAAIDTLKDLGAKARGSAFTLPANMVGPVCDMCSAVTAITYAGLLSYRDTVTQILGAPAGFAQMMANEALTNVTKLMADEGIDKMDDALNPGALLGTADSMNFGPLSEIVPTILESLEKRSK
>WP_019263574.1 - Methanobrevibacter smithii - [Fe]
MKVAILGAGCYRTHAASGITNFTRACEVAEETGKEKFAMTHSTIEMGAELLHLAGVDEVVVSDPVFDNDFTVVDDFDFQEVIAAHKAGKAEDVMPDIRAKVNELAESLPTPPKAAIHFVDPEDLGMKTMNDDAAAVADADWVMTWLPEGGMQKPIIEKFAGELKEGAILTHACTIPTTEFKNIFDECGANVNVASYHPGAVPEMKGQAYIGEGYADEASIKTLLELGEKARGSAFTLPANLLGPVCDMCSAVTAITYAGILAYRDTVTQILGAPAGFAQMMANEALTQVTALMQDEGIDKMDEALNPGALLGTADSMNFGPLAEIVPTVLENLEKRS
>WP_016357634.1 - Methanobrevibacter sp. AbM4 - [Fe]
MKVAILGAGCYRTHSASGITNFTRACEVAEQTGKKEIALTHSTIEMGAELLHLAGVDEVVVADPVFKEGLTIVDDFDYDEVIAAHKAGKPEDVMPAIREKVNSLAETVAKPPKGAIHFVDPEDLGMKTTADDSEAVADADWVMTWLPEGGMQPDIIKNFAGDIKEGAIVTHACTIPTTQFKKIFDDLGAKVNVASYHPGAVPEMKGQAYIAKGYASDEAINTLLELGTKARGEAFTLPANLLGPVCDMCSAVTAITYAGILAYRDTVTQILGAPAGFAQNMADQALTQVTALMNDEGIDKMDEALDPAALLGTADSMNFGALAEIVPTVLDYLGKDKKE
>WP_013296316.1 - Methanothermobacter marburgensis - [Fe]
MKLAILGAGCYRTHAASGITNFSRACEVAEMVGKPEIAMTHSTITMGAELKELAGVDEVVVADPVFDNQFTVIDDFAYEDVIEAHKEDPEKIMPQIREKVNEVAKELPKPPEGAIHFTHPEDLGFEITTDDREAVADADFIMTWFPKGDMQPDIINKFIDDIKPGAIVTHACTIPTTKFYKIFEQKHGDLVTKPETLNVTSYHPGAVPEMKGQVYIAEGYASEDAIETLFELGQKARGNAYRLPAELLGPVCDMCSALTAITYAGILSYRDSVTQVLGAPASFAQMMAKESLEQITALMEKVGIDKMEENLDPGALLGTADSMNFGASAEILPTVFEILEKRKK
Would I be able to use these to construct a custom database for DRAM?
What you want is something like
DRAM.py annotate \
-i 'your_bins/path/*.fa' \
-o 'your_output/path' \
--threads 20 \
--custom_db_name 'Fe_hydrogenase' \ # or whatever name you like
--custom_fasta_loc 'Fe_hydrogenase.fasta' # Path to the file
You can learn more with DRAM.py annotate -h.
Ok great, thank you. Is there an option to only use the custom databases you specify, without rerunning the standard databases?
No specific option, but there are ways to disable some databases. If you disable too many, you would lose the utility of using dram at all, but you should be able to remove the path to the databases you don't want to run.
Ok thank you. The reason I ask, is because I'm running out of time on our HPC. Maximum runtime for a job is 72 hours, and it's not enough with the standard + custom databases I have (also because I have 321 MAGs). Alternatively, since I think I only need the annotations.tsv file, if I split the genome collection in 3 sets of around 100 MAGs and run them separately, the annotations would be the same for each MAG right?
Yes, that is definitely a good option. It may affect the e-values calculated by some tools slightly, but there should be no meaningful difference. Nothing that would not stand up to scrutiny. It will take more time overall, but the jobs will not time out. You will be able to concatenate the results and make the product if you need to.
Hi, Disabling the standard databases to only get output from custom databases worked for me, so thanks for that!
If you disable too many, you would lose the utility of using dram at all, but you should be able to remove the path to the databases you don't want to run.
If I would like to check if I can use DRAM to get a distillate with these custom databases, do you know which ones of the standard databases I would have to add at minimum. Only Kegg/kofam?
Yes, Kegg/kofam would be the minimum, there are other options but Kegg/kofam is the best choice to get completeness with the minimum of computational effort. This is what I suspect, I must admit my limits here, I came to this project later, and so I have not experimented enough for this use case to know exactly what may go wrong. Thanks for exploring this for me! I suspect that you may need to do something more down the line to get the results in the distillate that you want. Also, you will lose information with each database you cut, so know that the results may be different if you include different DBs.