sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

`sourmash tax prepare` fails with `No taxonomic identifiers found.`

Open taylorreiter opened this issue 1 year ago • 9 comments

Command and output pasted below. Lineages csv attached and reproduced!

sourmash tax prepare --taxonomy-csv inputs/sourmash_databases/cheesegenomes.lineages.csv -o tmp.sqldb

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from 'inputs/sourmash_databases/cheesegenomes.lineages.csv': No taxonomic identifiers found.

cheesegenomes.lineages.csv:

ident,taxid,superkingdom,phylum,class,order,family,genus,species,strain
pcamembertiSAM3_3runs.flye.diamond_microbeProteome922.fs_corrected.pilon,5075,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium camemberti,SAM3_3
pen12.pilon,2720512,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium sp.,12
rs17.pilon,5081,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium sp.,RS-17
geo.pilon,1173061,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Dipodascaceae,Geotrichum,Geotrichum candidum,geo
JBC_canu.pilon,229535,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium nordicum,JBC
JB370.pilon,40374,Eukaryota,Ascomycota,Sordariomycetes,Microascales,Microascaceae,Scopulariopsis,Scopulariopsis sp.,JB370
135e.pilon,45537,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,,Diutina,Diutina catenulata,135e
135B.pilon,4959,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Debaryomycetaceae,Debaryomyces,Debaryomyces hansenii,135B

I can't think what would be causing this...I tried to essentially copy the genbank lineage formats.

taylorreiter avatar Oct 12 '22 16:10 taylorreiter

Probably should have tagged @bluegenes in this!

taylorreiter avatar Oct 12 '22 16:10 taylorreiter

some sort of weird formatting issue that affects the csv module but not pandas.read_csv.

Screen Shot 2022-10-13 at 9 57 33 AM

The file is in DOS format but ... weird. Nothing (vi, emacs, Mac OS Numbers) has a problem with it!

python code to reproduce:

import csv
r = csv.reader(open(filename, newline=''))

for row in r:
    print(row)
    break

tl;dr open, save as CSV, try again.

ctb avatar Oct 13 '22 16:10 ctb

ya that's deeply annoying and the solution. I read it into R and wrote it out again and the problems were fixed. Doing so in vim or excel did not fix it. le sigh. thank you for your help!!!!

taylorreiter avatar Oct 13 '22 20:10 taylorreiter

leave this open and I'll add something to the error output listing the headers that WERE found...

ctb avatar Oct 13 '22 20:10 ctb

🪄 🌟 thank you!

taylorreiter avatar Oct 13 '22 20:10 taylorreiter

ah-hah! figured it out:

Screen Shot 2022-10-13 at 2 53 25 PM

this is the "byte order mark (BOM)" that means this file is UTF-8 encoded. See https://stackoverflow.com/questions/50130605/python-2-7-csv-file-read-write-xef-xbb-xbf-code.

I'm not sure what the right move is here but at least I know what it is now!

ctb avatar Oct 13 '22 21:10 ctb

PR #2333 adds the following output:

% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'

Note the error output ("headers are") will be standard across all CSV-loading attempts, this is just an example using the tax summarize command (also new in #2333).

ctb avatar Oct 15 '22 16:10 ctb

asking question here:

https://twitter.com/ctitusbrown/status/1581666825623855104

ctb avatar Oct 16 '22 15:10 ctb

This Arrow PR adds support for BOM: https://github.com/apache/arrow/pull/11892

ctb avatar Oct 16 '22 16:10 ctb