sourmash
sourmash copied to clipboard
Add human reference genome to prepared databases
Hi Titus et al, Given the recent fiasco related to mapping reads to microbial databases without human references (links at bottom), it might be a good time to create a small human genome database for use with sourmash. A standalone database on the database page would be ideal, so that researchers can include with the other databases of interest.
Thanks for considering!
social media discussion: https://twitter.com/StevenSalzberg1/status/1686350449069244416 pre-print: https://doi.org/10.1101/2023.07.28.550993
On the "raw" side [^1] there are both GRCh38.p14 and T2T-CHM13v2.0 signatures in wort, would that work?
[^1]: just downloaded the data and calculated a signature, no other pre-processing like repeat masking
Yep! Those should be plenty.
Repo to sketch hg38, including all unmapped chromosomes: https://github.com/ctb/2024-human-sketch
note: decontaminating human WGS samples, https://github.com/sourmash-bio/sourmash/issues/3151
download at: https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/hg38/hg38-entire.sig.zip
https://github.com/sourmash-bio/2024-sketch-animal-genomes
added here - https://github.com/sourmash-bio/sourmash/pull/3422 - should add the t2t ones, too, though.
@ccbaumler suggests adding more animal genomes over in https://github.com/sourmash-bio/sourmash/pull/3422#issuecomment-2525421661:
rat https://www.ncbi.nlm.nih.gov/datasets/taxonomy/10116/
xenopus https://www.ncbi.nlm.nih.gov/datasets/taxonomy/8355/
zebrafish https://www.ncbi.nlm.nih.gov/datasets/taxonomy/7955/
drosophila https://ncbi.nlm.nih.gov/datasets/taxonomy/7227/
c. elegans https://www.ncbi.nlm.nih.gov/datasets/taxonomy/6239/
Rather than doing these piecemeal, I think we should come up with a set of accessions we care about and then use directsketch to get them, so for now I'm punting on that suggestion, but it is definitely the way we want to go!