VirSorter
VirSorter copied to clipboard
Custom database - How to?
Hey there,
I want to make a custom database, or at least update the current RefSeq based database to a 2019 version. How am I to do this? I can't seem to find a manual? Inspection of db1 shows me that it is not a simple blast database.
Cheers, Megan.
Hello,
I've got the same question as well. In addition, I'd like to use a ".fna" file as well as the Virome database. On CyVerse I can do this using "Additional viral sequence to be used as reference (optional)", but what is the command line argument for this if I run on my own server?
Closest I could find was the "--cp Custom phage sequence " argument, but no real documentation on how to use it, or if I just point to a FASTA file (my .fna file).
Best,
Mike
Hey there, I can actually help with that one:
You would use the --cp flag and your fasta file of additional sequences in conjunction with db1 and then I kept the database using the keepdatabase flag. The problem is that now I am unable to re-feed that database back in. At the moment I'm toying around with various options to try to make it work but if one of the developers could help further, that would be great.
This was my command
virsorter -f /srv/projects/coral/Seaquence_Accelerate_master_directory/analysis/20190520_metaspades_assemblies/metaspades_ROB3349A03-148_S3/contigs.fasta --db 1 --keep-db --cp "/srv/home/s4549287/20190429_RefSeq_Vir/refseqvir.fna" --wdir /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148 --ncpu 10 --data-dir /srv/db/virsorter_data/
Hi MeggyC,
Thanks for the command. I will give it a try today. In that case, if I set VirSorter to "decontamination mode" on the command line with my --cp custom database, will it remove all sequences that are not in the Refseq/Virome and my custom database, and return a new FASTA file with my contigs sequences? I am a bit confused as to how exactly the decontamination mode works and what the output is.
Cheers.
Mike
Hi Mike & Megan, Megan is right about the use of "--cp" on the command line, it will be the same thing as adding an "Additional viral sequence to be used as reference (optional)" on CyVerse (thanks for answering this ! ). Now for the other questions:
-
the "decontamination" mode is not meant as "removing viral sequences from a dataset". It is meant instead as "removing non-viral sequences from viral metagenomes" (which was one of the original purpose of VirSorter). The reason why there is a special mode for this is as follows: by default, VirSorter establish background probabilities (e.g. of PFAM hit, of viral gene hit, etc) based on the whole dataset, assuming that most sequences will be from cellular genomes. Then it searches which sequence might be "more viral than the background". If the background is all/most viral (e.g. a good viral metagenome), then nothing will be detected. To fix this, the decontamination mode doesn't calculate background probabilities from the dataset but instead use pre-calculated ones (based on cellular genomes from RefSeq, evenly sampled across taxonomic groups). So the output is exactly the same as a regular VirSorter, but this decontamination mode is really meant for cases where you suspect a large fraction (~ >50%) of your sequences are viral.
-
For the updated database, unfortunately the process is not very straightforward. In short one would have to (i) create a new protein clusters database, and generating corresponding HMMs, (ii) identify hallmark genes in this new database, as well as genes exclusively found in non-Caudovirales, (iii) build a blast database of the sequences not included in the protein cluster, and (iv) probably recalculate background probabilities based on this new database against RefSeq for the decontamination mode. Now if you mostly want to have a database that includes your new sequence(s) without having to add them as "Additional fasta file" every time, then you should be able to extract the files from a VirSorter run. What you'll likely have to do is kill the VirSorter process while it's not yet finished, because its last step is to "clean" the output directory and remove the database files that are quite big. If you manage to stop it at Step 3 or 4, then you should have a folder named "r_0" in the output directory, and in this folder another one named "db". This should be your database including the VirSorter selected one (RefSeq or Virome) with your additional sequence included.
Let me know if this helps !
Hi Simon,
Thank you for the detailed answer. So from my understanding, since the contigs I have are mostly viral, I should be using the "decontamination mode" to reveal more viral sequences which would otherwise be ignored as background.
Following up on the use of "--cp", is there any way to only compare contigs against my custom database FASTA file defined in "--cp", or must it always be Refseq/virome AND the --cp database?
Also, when using the "--keep-db" flag, wouldn't this just keep the files in the "r_0" db folder which I could then copy and paste into the "Phage_gene_catalog_plus_viromes" and overwrite to have my custom phages included each time?
All the best,
Michael
Hi Michael,
You're correct about the decontamination mode. For the cp flag, there is currently no way to only compare contigs to your custom database, the "--cp" is always on top of either RefSeq or Virome.
And thanks for reminding me of the "--keep-db" flag, I had completely forgotten we had put this here. So yes, scratch my comment about interrupting VirSorter, you can use this flag, and then look in the r_0 db folder and you should find a database that include your own custom phage.
Best, Simon
Hi Simon,
And in that case, I could overwrite the files in the "Phage_gene_catalog_plus_viromes" (since I use viromes not refseq) with the contents from r_0 and it will always include my custom phages?
Best,
Michael
Correct, that is the expected behavior (I would suggest doing a backup copy of the "Phage_gene_catalog_plus_viromes" just in case though).
Thank you! I will report back if any other issues come back on this topic, otherwise it seems easy enough to do.
Cheers.
EDIT: it worked perfectly! I copied files from "r_0" (excluding the folder named "initial_db") to the directory "Phage_gene_catalog_plus_viromes", overwriting all that was conflicting (after backing up original viromes database files)...and now every time I run VirSorted with --db 2 (viromes), it includes my custom phages automatically, no need for --cp or --keep-db flags.
Hey Simon,
My problem now is that I have not been able to successfully re-feed the database from the keep database flag back into virsorter. The program runs - but produces no output.
Here is my somewhat longwinded report on the issue:
Command:
Trial 1) to see whether virsorter accepts my new db - FLAGS: --db 1 --data-dir /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/
virsorter -f /srv/home/s4549287/tmp/trial1/contigs.fasta --db 1 --wdir /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini1 --ncpu 10 --data-dir /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/ &> /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini1/mini.log
Error: File existence/permissions problem in trying to open HMM file /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/PFAM_27/Pfam-A.hmm. HMM file /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/PFAM_27/Pfam-A.hmm no
Error: File existence/permissions problem in trying to open HMM file /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/PFAM_27/Pfam-B.hmm. HMM file /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/PFAM_27/Pfam-B.hmm no
BLAST Database error: No alias or index file found for protein database [/srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini1/r_0/db/Pool_new_unclustered] in search path [/srv/home/s4549287::] Can't open '/srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini1/Contigs_prots_vs_Phage_Gene_Catalog.tab' for reading: 'No such file or directory' at /usr/local/bin/Scripts/Step_2_merge_contigs_annotation.pl line 103 Can't open '/srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini1/VIRSorter_affi-contigs.csv' for reading: 'No such file or directory' at /usr/local/bin/Scripts/Step_3_highlight_phage_signal.pl line 59 Can't open '/srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini1/VIRSorter_phage-signal.csv' for reading: 'No such file or directory' at /usr/local/bin/Scripts/Step_4_summarize_phage_signal.pl line 83
Error: File existence/permissions problem in trying to open HMM file /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/PFAM_27/Pfam-A.hmm. HMM file /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/PFAM_27/Pfam-A.hmm no
Error: File existence/permissions problem in trying to open HMM file /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/PFAM_27/Pfam-B.hmm. HMM file /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/PFAM_27/Pfam-B.hmm no
Can't open '/srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini1/Contigs_prots_vs_Phage_Gene_unclustered.tab' for reading: 'No such file or directory' at /usr/local/bin/Scripts/Step_2_merge_contigs_annotation.pl line 79 Can't open '/srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini1/VIRSorter_affi-contigs.csv' for reading: 'No such file or directory' at /usr/local/bin/Scripts/Step_3_highlight_phage_signal.pl line 59 Can't open '/srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini1/VIRSorter_phage-signal.csv' for reading: 'No such file or directory' at /usr/local/bin/Scripts/Step_4_summarize_phage_signal.pl line 83
Trial 2) FLAGS: no --db flag --data dir /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/
virsorter -f /srv/home/s4549287/tmp/trial1/contigs.fasta --wdir /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini2 --ncpu 10 --data-dir /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/
Error: File existence/permissions problem in trying to open HMM file /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/PFAM_27/Pfam-A.hmm. HMM file /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/PFAM_27/Pfam-A.hmm no
Error: File existence/permissions problem in trying to open HMM file /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/PFAM_27/Pfam-B.hmm. HMM file /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/PFAM_27/Pfam-B.hmm no
BLAST Database error: No alias or index file found for protein database [/srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini2/r_0/db/Pool_new_unclustered] in search path [/srv/home/s4549287::] Can't open '/srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini2/Contigs_prots_vs_Phage_Gene_Catalog.tab' for reading: 'No such file or directory' at /usr/local/bin/Scripts/Step_2_merge_contigs_annotation.pl line 103 Can't open '/srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini2/VIRSorter_affi-contigs.csv' for reading: 'No such file or directory' at /usr/local/bin/Scripts/Step_3_highlight_phage_signal.pl line 59 Can't open '/srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_mini2/VIRSorter_phage-signal.csv' for reading: 'No such file or directory' at /usr/local/bin/Scripts/Step_4_summarize_phage_signal.pl line 83
These are errors for the trial2 - just data-dir to r_0/db given
Of course, for all my assemblies I can create a new (but identical) database every time from the same .fna file. However, this seems a little computationally expensive to me. Thanks for your help!
Hi Megan, Yes, unfortunately, you won't be able to feed "r_0" directory as "data_dir", because it doesn't have all the databases. The way this should (hopefully) work is as follows:
- in the folder where you unpacked virsorter-data-v2.tar.gz, make a copy of "Phage_gene_catalog" for backup
cp -r Phage_gene_catalog/ Phage_gene_catalog.backup/
- now into the "Phage_gene_catalog" folder, copy over all files from r_0/db/ and replace the ones currently in "Phage_gene_catalog". The four important files here are "Pool_clusters.hmm", "Phage_Clusters_current.tab", "Pool_new_unclustered.faa" and "Blast_unclustered.tab"
cd Phage_gene_catalog cp /srv/projects/coral/Accelerate_project/analyses/20190619_VirSorter_assemblies/test_CR_148/r_0/db/*.* .
- Now you should be able to run VirSorter with database 1, and this should use your "updated" database.