Unexpected classification results with Ganon using abfv_rs database (E. coli and K. pneumoniae overrepresented)
Hi,
First of all, thank you for developing Ganon — it's a powerful and much-needed tool for high-performance metagenomic classification.
I'm using Ganon with a database built on December 5, 2023, using the following command:
ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --hibf --level species --db-prefix abfv_rs
Then, for classification, I used:
dbprefix=$(find -L . -name '*.hibf' | sed 's/\.hibf$//')
ganon \
classify \
--db-prefix ${dbprefix%%.hibf} \
--threads 20 \
--output-prefix *_ganon-abfv_rs-2023-12-05.ganon \
--paired-reads *.unmapped_1.fastq.gz *.unmapped_2.fastq.gz \
2>&1 | tee *_ganon-abfv_rs-2023-12-05.ganon.log
However, I observed unexpected results. Specifically, I analyzed both a mock community sample (with known species composition) and several human skin samples. These samples should not contain Escherichia coli or Klebsiella pneumoniae, but these two species were among the top 10 most classified species in the Ganon results.
I validated the same input data using Kraken2, MetaPhlAn 4 tool, IDSeq, and Nephele pipelines, and none of them reported E. coli or K. pneumoniae.
Could this issue be due to the database, the classification algorithm, or something else? Any suggestions or clarifications would be appreciated.
Thanks in advance!
It sounds like a database related issue. You should see similar results in the tools you reported if they were using the same set of reference sequences in the database.
You could try to be more strict with your cutoff and filter (https://pirovc.github.io/ganon/classification/#cutoff-and-filter-rel-cutoff-rel-filter). Pay attention to the number of matches/read that ganon reports after classification. A higher number of multiple matching reads could mean that you have spurious matches. In this case, you can lower the --rel-filter. You could further try to use LCA to solve your multiple-matches (https://pirovc.github.io/ganon/classification/#reads-with-multiple-matches) or modify the report type (https://pirovc.github.io/ganon/reports/#report-type-report-type). Let me know if you need some help understanding those.
ganon reports what it finds based on thresholds using k-mer similarity. If there are E. coli in your results something in your input data has similarity to it. Keep in mind:
- genomes assemblies on the RefSeq may still have some level of contamination
- there are conserved genomic regions between bacterial species (16S).
- NCBI Taxonomy is not fully based on sequence similarity, so there's a chance you having issues related to that (e.g. some
Shigellagenomes have a high similarity toE. coligenomes but belong to different genus), maybe give the GTDB a try.
Thank you @pirovc for your suggestions — I’ll try them out as soon as possible and will update the results.
By the way, I noticed that the Ganon2 preprint has been available for quite a while but hasn’t been published in a journal yet. I'm just curious if there’s any update on its publication status — I’m really looking forward to seeing it officially released, as I’m very interested in its improvements and future developments.
Thanks again for your time and support!
The paper is accepted and it will soon be online.
Thank you @pirovc for your suggested solutions. I’ve tried each of them separately — for example, lowering --rel-filter to 0.05 and even 0, switching the algorithm used to solve multiple matches, and also modifying the report type. I also tested various combinations of these parameters.
Unfortunately, the results remained the same — E. coli still appeared with high abundance.
I also used the --output-all option to extract some reads classified as E. coli and ran BLAST on them. The results showed 100% identity to other species, not E. coli. This leads me to believe the issue might actually stem from the database.
Upon checking, I noticed that the default k-mer size used in the build was 19 bp, which seems quite short — especially for distinguishing between highly similar regions across species. Do you think this could be contributing to the misclassification?
Thanks again for your help and support!
Increasing the k-mer size with have a slight benefit for sensitivity but with larger database sizes. It's worth a try but I think it will not solve your issue.
I think the best alternative here would be to use a limited number of reference sequences for each species (check this instructions). E. coli and K. pneumoniae are one of most represented species in the RefSeq: total RefSeq bacterial assemblies today 435374 where 44348 assemblies are from E. coli (~10%) and 25319 from K. pneumoniae (~5%). A possibility is that conseved regions of your reads (e.g. 16S gene parts or contamination) are matching those reference. Using only some top assemblies may reduce the overall diversity of your database but may solve your issue.
To fully understand this issue with ganon, I would build a database of only E. coli genomes at assembly level ganon build --source refseq --taxid 562 --level assembly and examine where your reads are mapping to exactly.
I appreciate you @pirovc for your recommendations. It looks like this will take some time for me to try out and evaluate properly. I'll make sure to share any updates as soon as I can. I hope the issue I encountered can be helpful for you and other users as well, to avoid similar problems in the future.