sourmash
sourmash copied to clipboard
missing some organisms with tax annotate
I used tax annotate to get the lineages for my fastmultigather results, and noticed several organisms were missing (10 out of ~4,700 species in my specific results). These includes several mammals which have taxonomic information on NCBI RefSeq. Although I can check for missing columns and manually add this information, would appreciate any thoughts on how some organisms get skipped (and maybe a recommendation for reporting missing entries).
sourmash tax annotate -g ERR1395610.x.entire.csv -t /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb
Some of the Missing Individuals
- GCA_029281585.3 western lowland gorilla (Gorilla gorilla gorilla)
- GCA_028885655.3 Sumatran orangutan (Pongo abelii)
- GCA_029289425.3 pygmy chimpanzee (Pan paniscus)
- GCA_017312705.2 Penaeus japonicus
- GCA_002575655.3 Aegilops tauschii subsp. strangulata
- GCA_902459505.2 Gaboon caecilian (Geotrypetes seraphini)
per https://github.com/sourmash-bio/sourmash/issues/3504, this code:
https://github.com/sourmash-bio/2025-sourmash-eukaryotic-databases/blob/main/Snakefile#L122
rule lineages_csv:
input:
"collections/{NAME}.links.csv",
output:
"databases/{NAME}.lineages.csv",
shell: """
scripts/taxid-to-lineages.taxonkit.py {input} -o {output}
"""
was used to generate the lineages CSV file.