sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

missing some organisms with tax annotate

Open bettafische opened this issue 7 months ago • 1 comments

I used tax annotate to get the lineages for my fastmultigather results, and noticed several organisms were missing (10 out of ~4,700 species in my specific results). These includes several mammals which have taxonomic information on NCBI RefSeq. Although I can check for missing columns and manually add this information, would appreciate any thoughts on how some organisms get skipped (and maybe a recommendation for reporting missing entries).

sourmash tax annotate -g ERR1395610.x.entire.csv -t /group/ctbrowngrp5/sourmash-db/entire-2025-01-21/entire-2025-01-21.lineages.sqldb

Some of the Missing Individuals

  1. GCA_029281585.3 western lowland gorilla (Gorilla gorilla gorilla)
  2. GCA_028885655.3 Sumatran orangutan (Pongo abelii)
  3. GCA_029289425.3 pygmy chimpanzee (Pan paniscus)
  4. GCA_017312705.2 Penaeus japonicus
  5. GCA_002575655.3 Aegilops tauschii subsp. strangulata
  6. GCA_902459505.2 Gaboon caecilian (Geotrypetes seraphini)

bettafische avatar Mar 28 '25 22:03 bettafische

per https://github.com/sourmash-bio/sourmash/issues/3504, this code:

https://github.com/sourmash-bio/2025-sourmash-eukaryotic-databases/blob/main/Snakefile#L122

rule lineages_csv:
    input:
        "collections/{NAME}.links.csv",
    output:
        "databases/{NAME}.lineages.csv",
    shell: """
        scripts/taxid-to-lineages.taxonkit.py {input} -o {output}
    """

was used to generate the lineages CSV file.

ctb avatar Mar 29 '25 12:03 ctb