DRAM icon indicating copy to clipboard operation
DRAM copied to clipboard

Custom database

Open slambrechts opened this issue 3 years ago • 7 comments

Hi,

On the main github page it says DRAM can also use custom user databases. Is it explained somewhere how to do this?

I have fasta files containing protein sequences for a number of metabolic marker genes that I would like to look for in my MAGs. These fasta files are titled for example Fe_hydrogenase.fasta and look like so:

>WP_004030875.1 - Methanobacterium formicicum - [Fe]
MKLAILGAGCYRTHAASGITNFSRACEVAEQVGKPEIAMTHSTIAMGAELKELAGIDEIVVSDPVFDNDFTVIDDFEYEAVIEAHKKDPESIMPQIREKVNAVAKDLPKPPKGAIHFTHPEDLGFEVTTDDNEAVQDADWVMTWFPKGDMQMGIIKEFADNLKEGAILTHACTVPTTMFQKIFEDLSSDEMNIAPKVNVSSYHPGAVPEMKGQVYIAEGYASEDAICKLVDWGVAARGDAFKLPAELLGPVCDMCSALTAITYAGILSYRDSVMNILGAPAGFAQMMAKESLTQVTDLMNSVGIDHMEEKLDPGALLGTADSMNFGAAADVLPSVLEVLENRKGKGPTCNI
>WP_012955328.1 - Methanobrevibacter ruminantium - [Fe]
MKVAILGAGCYRTHAASGITNFSRACEVADATGKENISMTHSTIEMGAELLELAGVDEVVVADPVFDGEFTVVEDFDYAEVIAAHKAGNPEDVMPAIRAKVGELAETVPKPANGAIHFTHPEDLGMKCTTDDREAVADADWIMTWLPEGGMQPAIIEKFADVIKDGAIVTSACTIPTPGLNQIFEDLGKNVNVASYHPGAVPEMKGQVYIAEGFADQAAIDTLKDLGAKARGSAFTLPANMVGPVCDMCSAVTAITYAGLLSYRDTVTQILGAPAGFAQMMANEALTNVTKLMADEGIDKMDDALNPGALLGTADSMNFGPLSEIVPTILESLEKRSK
>WP_019263574.1 - Methanobrevibacter smithii - [Fe]
MKVAILGAGCYRTHAASGITNFTRACEVAEETGKEKFAMTHSTIEMGAELLHLAGVDEVVVSDPVFDNDFTVVDDFDFQEVIAAHKAGKAEDVMPDIRAKVNELAESLPTPPKAAIHFVDPEDLGMKTMNDDAAAVADADWVMTWLPEGGMQKPIIEKFAGELKEGAILTHACTIPTTEFKNIFDECGANVNVASYHPGAVPEMKGQAYIGEGYADEASIKTLLELGEKARGSAFTLPANLLGPVCDMCSAVTAITYAGILAYRDTVTQILGAPAGFAQMMANEALTQVTALMQDEGIDKMDEALNPGALLGTADSMNFGPLAEIVPTVLENLEKRS
>WP_016357634.1 - Methanobrevibacter sp. AbM4 - [Fe]
MKVAILGAGCYRTHSASGITNFTRACEVAEQTGKKEIALTHSTIEMGAELLHLAGVDEVVVADPVFKEGLTIVDDFDYDEVIAAHKAGKPEDVMPAIREKVNSLAETVAKPPKGAIHFVDPEDLGMKTTADDSEAVADADWVMTWLPEGGMQPDIIKNFAGDIKEGAIVTHACTIPTTQFKKIFDDLGAKVNVASYHPGAVPEMKGQAYIAKGYASDEAINTLLELGTKARGEAFTLPANLLGPVCDMCSAVTAITYAGILAYRDTVTQILGAPAGFAQNMADQALTQVTALMNDEGIDKMDEALDPAALLGTADSMNFGALAEIVPTVLDYLGKDKKE
>WP_013296316.1 - Methanothermobacter marburgensis - [Fe]
MKLAILGAGCYRTHAASGITNFSRACEVAEMVGKPEIAMTHSTITMGAELKELAGVDEVVVADPVFDNQFTVIDDFAYEDVIEAHKEDPEKIMPQIREKVNEVAKELPKPPEGAIHFTHPEDLGFEITTDDREAVADADFIMTWFPKGDMQPDIINKFIDDIKPGAIVTHACTIPTTKFYKIFEQKHGDLVTKPETLNVTSYHPGAVPEMKGQVYIAEGYASEDAIETLFELGQKARGNAYRLPAELLGPVCDMCSALTAITYAGILSYRDSVTQVLGAPASFAQMMAKESLEQITALMEKVGIDKMEENLDPGALLGTADSMNFGASAEILPTVFEILEKRKK

Would I be able to use these to construct a custom database for DRAM?

slambrechts avatar Dec 01 '21 14:12 slambrechts

What you want is something like

  DRAM.py annotate \
          -i 'your_bins/path/*.fa' \
          -o 'your_output/path' \
          --threads 20 \
          --custom_db_name 'Fe_hydrogenase'  \ # or whatever name you like
          --custom_fasta_loc 'Fe_hydrogenase.fasta' # Path to the file

You can learn more with DRAM.py annotate -h.

rmFlynn avatar Dec 01 '21 17:12 rmFlynn

Ok great, thank you. Is there an option to only use the custom databases you specify, without rerunning the standard databases?

slambrechts avatar Dec 02 '21 10:12 slambrechts

No specific option, but there are ways to disable some databases. If you disable too many, you would lose the utility of using dram at all, but you should be able to remove the path to the databases you don't want to run.

rmFlynn avatar Dec 08 '21 00:12 rmFlynn

Ok thank you. The reason I ask, is because I'm running out of time on our HPC. Maximum runtime for a job is 72 hours, and it's not enough with the standard + custom databases I have (also because I have 321 MAGs). Alternatively, since I think I only need the annotations.tsv file, if I split the genome collection in 3 sets of around 100 MAGs and run them separately, the annotations would be the same for each MAG right?

slambrechts avatar Dec 08 '21 13:12 slambrechts

Yes, that is definitely a good option. It may affect the e-values calculated by some tools slightly, but there should be no meaningful difference. Nothing that would not stand up to scrutiny. It will take more time overall, but the jobs will not time out. You will be able to concatenate the results and make the product if you need to.

rmFlynn avatar Dec 08 '21 16:12 rmFlynn

Hi, Disabling the standard databases to only get output from custom databases worked for me, so thanks for that!

If you disable too many, you would lose the utility of using dram at all, but you should be able to remove the path to the databases you don't want to run.

If I would like to check if I can use DRAM to get a distillate with these custom databases, do you know which ones of the standard databases I would have to add at minimum. Only Kegg/kofam?

slambrechts avatar Dec 13 '21 11:12 slambrechts

Yes, Kegg/kofam would be the minimum, there are other options but Kegg/kofam is the best choice to get completeness with the minimum of computational effort. This is what I suspect, I must admit my limits here, I came to this project later, and so I have not experimented enough for this use case to know exactly what may go wrong. Thanks for exploring this for me! I suspect that you may need to do something more down the line to get the results in the distillate that you want. Also, you will lose information with each database you cut, so know that the results may be different if you include different DBs.

rmFlynn avatar Dec 13 '21 16:12 rmFlynn