sourmash `sourmash tax prepare` fails with `No taxonomic identifiers found.`

`sourmash tax prepare` fails with `No taxonomic identifiers found.`

Open taylorreiter opened this issue 1 year ago • 9 comments

Command and output pasted below. Lineages csv attached and reproduced!

sourmash tax prepare --taxonomy-csv inputs/sourmash_databases/cheesegenomes.lineages.csv -o tmp.sqldb

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from 'inputs/sourmash_databases/cheesegenomes.lineages.csv': No taxonomic identifiers found.

cheesegenomes.lineages.csv:

ident,taxid,superkingdom,phylum,class,order,family,genus,species,strain
pcamembertiSAM3_3runs.flye.diamond_microbeProteome922.fs_corrected.pilon,5075,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium camemberti,SAM3_3
pen12.pilon,2720512,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium sp.,12
rs17.pilon,5081,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium sp.,RS-17
geo.pilon,1173061,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Dipodascaceae,Geotrichum,Geotrichum candidum,geo
JBC_canu.pilon,229535,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium nordicum,JBC
JB370.pilon,40374,Eukaryota,Ascomycota,Sordariomycetes,Microascales,Microascaceae,Scopulariopsis,Scopulariopsis sp.,JB370
135e.pilon,45537,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,,Diutina,Diutina catenulata,135e
135B.pilon,4959,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Debaryomycetaceae,Debaryomyces,Debaryomyces hansenii,135B

I can't think what would be causing this...I tried to essentially copy the genbank lineage formats.

Oct 12 '22 16:10 taylorreiter

Probably should have tagged @bluegenes in this!

Oct 12 '22 16:10 taylorreiter

some sort of weird formatting issue that affects the csv module but not pandas.read_csv.

The file is in DOS format but ... weird. Nothing (vi, emacs, Mac OS Numbers) has a problem with it!

python code to reproduce:

import csv
r = csv.reader(open(filename, newline=''))

for row in r:
    print(row)
    break

tl;dr open, save as CSV, try again.

Oct 13 '22 16:10 ctb

ya that's deeply annoying and the solution. I read it into R and wrote it out again and the problems were fixed. Doing so in vim or excel did not fix it. le sigh. thank you for your help!!!!

Oct 13 '22 20:10 taylorreiter

leave this open and I'll add something to the error output listing the headers that WERE found...

Oct 13 '22 20:10 ctb

🪄 🌟 thank you!

Oct 13 '22 20:10 taylorreiter

ah-hah! figured it out:

this is the "byte order mark (BOM)" that means this file is UTF-8 encoded. See https://stackoverflow.com/questions/50130605/python-2-7-csv-file-read-write-xef-xbb-xbf-code.

I'm not sure what the right move is here but at least I know what it is now!

Oct 13 '22 21:10 ctb

PR #2333 adds the following output:

% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'

Note the error output ("headers are") will be standard across all CSV-loading attempts, this is just an example using the tax summarize command (also new in #2333).

Oct 15 '22 16:10 ctb

asking question here:

https://twitter.com/ctitusbrown/status/1581666825623855104

Oct 16 '22 15:10 ctb

This Arrow PR adds support for BOM: https://github.com/apache/arrow/pull/11892

Oct 16 '22 16:10 ctb

sourmash sourmash copied to clipboard

`sourmash tax prepare` fails with `No taxonomic identifiers found.`

sourmash
sourmash copied to clipboard