sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

Missing species designation when using `sourmash lca classify` on a Haemophilus influenzae

Open replikation opened this issue 8 months ago • 2 comments

Dear sourmash team,

I have been using your great tool for years now and stumbled upon a strange behavior.

Issue: Missing species description when using sourmash lca classify on a Haemophilus influenzae genome.

Example fasta: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=1355929925&rettype=fasta (I tried a few other H. influenza strains, but the same missing species issue. However, different species work fine, such as E.coli, S. aureus, etc.) Database used: https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-k31.lca.json.gz (same issue on an older DB)

version sourmash 4.8.14

Results:

ID status superkingdom phylum class order family genus species strain
NZ_CP020010.1 Haemophilus influenzae strain 67P38H1 chromosome, complete genome found d__Bacteria p__Pseudomonadota c__Gammaproteobacteria o__Enterobacterales_A f__Pasteurellaceae g__Haemophilus    

side info: sourmash results when using sourmash gather

intersect_bp f_orig_query f_match f_unique_to_query f_unique_weighted average_abund median_abund std_abund filename name md5 f_match_orig unique_intersect_bp gather_result_rank remaining_bp query_filename query_name query_md5 query_bp ksize moltype scaled query_n_hashes query_abundance query_containment_ani match_containment_ani average_containment_ani max_containment_ani potential_false_negative n_unique_weighted_found sum_weighted_found total_weighted_hashes  
1840000 1 1 1 1       gtdb-rs214-k31.lca.json.gz GCF_002966675.1 Haemophilus influenzae strain=67P56H1, ASM296667v1 2f2a17a4bfe161d8e25ecbc0beaffd27 1 1840000 0 0 h.influenzae.fasta   e0f34f5b 184000 31 DNA 10000 184 FALSE 1 1 1 1 FALSE   184 184  

replikation avatar Feb 24 '25 15:02 replikation