sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

Add human reference genome to prepared databases

Open dportik opened this issue 2 years ago • 5 comments

Hi Titus et al, Given the recent fiasco related to mapping reads to microbial databases without human references (links at bottom), it might be a good time to create a small human genome database for use with sourmash. A standalone database on the database page would be ideal, so that researchers can include with the other databases of interest.

Thanks for considering!

social media discussion: https://twitter.com/StevenSalzberg1/status/1686350449069244416 pre-print: https://doi.org/10.1101/2023.07.28.550993

dportik avatar Aug 16 '23 21:08 dportik

On the "raw" side [^1] there are both GRCh38.p14 and T2T-CHM13v2.0 signatures in wort, would that work?

[^1]: just downloaded the data and calculated a signature, no other pre-processing like repeat masking

luizirber avatar Aug 17 '23 03:08 luizirber

Yep! Those should be plenty.

dportik avatar Aug 22 '23 17:08 dportik

Repo to sketch hg38, including all unmapped chromosomes: https://github.com/ctb/2024-human-sketch

ctb avatar May 11 '24 15:05 ctb

note: decontaminating human WGS samples, https://github.com/sourmash-bio/sourmash/issues/3151

ctb avatar May 11 '24 15:05 ctb

download at: https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/hg38/hg38-entire.sig.zip

ctb avatar May 11 '24 15:05 ctb

https://github.com/sourmash-bio/2024-sketch-animal-genomes

ctb avatar Dec 06 '24 18:12 ctb

added here - https://github.com/sourmash-bio/sourmash/pull/3422 - should add the t2t ones, too, though.

ctb avatar Dec 07 '24 17:12 ctb

@ccbaumler suggests adding more animal genomes over in https://github.com/sourmash-bio/sourmash/pull/3422#issuecomment-2525421661:

  1. rat https://www.ncbi.nlm.nih.gov/datasets/taxonomy/10116/

  2. xenopus https://www.ncbi.nlm.nih.gov/datasets/taxonomy/8355/

  3. zebrafish https://www.ncbi.nlm.nih.gov/datasets/taxonomy/7955/

  4. drosophila https://ncbi.nlm.nih.gov/datasets/taxonomy/7227/

  5. c. elegans https://www.ncbi.nlm.nih.gov/datasets/taxonomy/6239/

Rather than doing these piecemeal, I think we should come up with a set of accessions we care about and then use directsketch to get them, so for now I'm punting on that suggestion, but it is definitely the way we want to go!

ctb avatar Dec 08 '24 14:12 ctb