sourmash
sourmash copied to clipboard
`sourmash tax prepare` fails with `No taxonomic identifiers found.`
Command and output pasted below. Lineages csv attached and reproduced!
sourmash tax prepare --taxonomy-csv inputs/sourmash_databases/cheesegenomes.lineages.csv -o tmp.sqldb
== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from 'inputs/sourmash_databases/cheesegenomes.lineages.csv': No taxonomic identifiers found.
ident,taxid,superkingdom,phylum,class,order,family,genus,species,strain
pcamembertiSAM3_3runs.flye.diamond_microbeProteome922.fs_corrected.pilon,5075,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium camemberti,SAM3_3
pen12.pilon,2720512,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium sp.,12
rs17.pilon,5081,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium sp.,RS-17
geo.pilon,1173061,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Dipodascaceae,Geotrichum,Geotrichum candidum,geo
JBC_canu.pilon,229535,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium nordicum,JBC
JB370.pilon,40374,Eukaryota,Ascomycota,Sordariomycetes,Microascales,Microascaceae,Scopulariopsis,Scopulariopsis sp.,JB370
135e.pilon,45537,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,,Diutina,Diutina catenulata,135e
135B.pilon,4959,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Debaryomycetaceae,Debaryomyces,Debaryomyces hansenii,135B
I can't think what would be causing this...I tried to essentially copy the genbank lineage formats.
Probably should have tagged @bluegenes in this!
some sort of weird formatting issue that affects the csv
module but not pandas.read_csv
.
The file is in DOS format but ... weird. Nothing (vi, emacs, Mac OS Numbers) has a problem with it!
python code to reproduce:
import csv
r = csv.reader(open(filename, newline=''))
for row in r:
print(row)
break
tl;dr open, save as CSV, try again.
ya that's deeply annoying and the solution. I read it into R and wrote it out again and the problems were fixed. Doing so in vim or excel did not fix it. le sigh. thank you for your help!!!!
leave this open and I'll add something to the error output listing the headers that WERE found...
🪄 🌟 thank you!
ah-hah! figured it out:
this is the "byte order mark (BOM)" that means this file is UTF-8 encoded. See https://stackoverflow.com/questions/50130605/python-2-7-csv-file-read-write-xef-xbb-xbf-code.
I'm not sure what the right move is here but at least I know what it is now!
PR #2333 adds the following output:
% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv
== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'
Note the error output ("headers are") will be standard across all CSV-loading attempts, this is just an example using the tax summarize
command (also new in #2333).
asking question here:
https://twitter.com/ctitusbrown/status/1581666825623855104
This Arrow PR adds support for BOM: https://github.com/apache/arrow/pull/11892