HTStream icon indicating copy to clipboard operation
HTStream copied to clipboard

hts_SeqScreener enhancements for bigger references

Open samhunter opened this issue 4 years ago • 0 comments

hts_SeqScreener is meant to filter/identify reads originating from specific source sequences (PhiX as default, but also ribosomal sequences or adapters etc).

Is your enhancement request related to a problem? Please describe. Currently hts_SeqScreener is not optimized for large references. It hasn't been tested much or at all on human sized genomes (~3gbp), but is not expected to work well, and would be very slow.

Describe the solution you'd like A number of alternative algorithms/data structures have been designed to speed up similar processes. Mapping is essentially the same: Minimap2: https://github.com/lh3/minimap2#algo minimizer schemes: https://www.biorxiv.org/content/10.1101/652925v1.full.pdf https://homolog.us/blogs/bioinfo/2017/10/25/intro-minimizer/ https://pdfs.semanticscholar.org/18a3/3e90b5e6872d33e32c4b9bd6f2fe577be8d6.pdf

But there is also Kraken2: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0

Implementing something similar to what is used in one of these tools could make screening against a human size genome possible

Additional context

samhunter avatar Jun 09 '20 22:06 samhunter