sourmash
sourmash copied to clipboard
Missing species designation when using `sourmash lca classify` on a Haemophilus influenzae
Dear sourmash team,
I have been using your great tool for years now and stumbled upon a strange behavior.
Issue: Missing species description when using sourmash lca classify on a Haemophilus influenzae genome.
Example fasta: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=1355929925&rettype=fasta (I tried a few other H. influenza strains, but the same missing species issue. However, different species work fine, such as E.coli, S. aureus, etc.) Database used: https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-k31.lca.json.gz (same issue on an older DB)
version sourmash 4.8.14
Results:
| ID | status | superkingdom | phylum | class | order | family | genus | species | strain |
|---|---|---|---|---|---|---|---|---|---|
| NZ_CP020010.1 Haemophilus influenzae strain 67P38H1 chromosome, complete genome | found | d__Bacteria | p__Pseudomonadota | c__Gammaproteobacteria | o__Enterobacterales_A | f__Pasteurellaceae | g__Haemophilus |
side info:
sourmash results when using sourmash gather
| intersect_bp | f_orig_query | f_match | f_unique_to_query | f_unique_weighted | average_abund | median_abund | std_abund | filename | name | md5 | f_match_orig | unique_intersect_bp | gather_result_rank | remaining_bp | query_filename | query_name | query_md5 | query_bp | ksize | moltype | scaled | query_n_hashes | query_abundance | query_containment_ani | match_containment_ani | average_containment_ani | max_containment_ani | potential_false_negative | n_unique_weighted_found | sum_weighted_found | total_weighted_hashes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1840000 | 1 | 1 | 1 | 1 | gtdb-rs214-k31.lca.json.gz | GCF_002966675.1 Haemophilus influenzae strain=67P56H1, ASM296667v1 | 2f2a17a4bfe161d8e25ecbc0beaffd27 | 1 | 1840000 | 0 | 0 | h.influenzae.fasta | e0f34f5b | 184000 | 31 | DNA | 10000 | 184 | FALSE | 1 | 1 | 1 | 1 | FALSE | 184 | 184 |