genomad icon indicating copy to clipboard operation
genomad copied to clipboard

High discrepancy between the high amount of viruses for geNomad and the low number of complete viruses

Open bhagavadgitadu22 opened this issue 5 months ago • 5 comments

Hello,

I am currently struggling with metagenomes obtained from environmental biofilms (river rocks). I have 50 metagenomes but I obtain only 100 checkV high-quality viruses. I would expect a lot more from this amount of metagenomes, probably about 1000 HQ viruses, especially considering that geNomad identifies about 500 viruses per metagenome (about 30000 total). My understanding is that all my viruses are small genome fragments and very few were completed during the assembly so:

  • Maybe my DNA was fragmented to start with even before I made my libraries, in which case my only hope would be to do long-read sequencing as well
  • Or my DNA is ok but I need to improve my pipeline

In the latter case, I am wondering what would be your advice:

  • Forget about checkV and make my own thresholds (example keep predicted viruses with more than 5 hallmark genes in geNomad...)
  • Look more into the numerous checkV Not-determined viruses (no viral genes detected) considering geNomad identified them as viruses for a reason so they might be super novel viruses that checkV cannot evaluate
  • Include checkV Medium-quality viruses into the analysis rather than only High-quality ones
  • Complement geNomad viruses with other viral detection tools, for instance Metaviral SPADES because it uses a completely different approach so it might return completely different viruses
  • Anything else that I did not think of :)

Any insight would be welcome!

Martin

bhagavadgitadu22 avatar Feb 29 '24 15:02 bhagavadgitadu22