woltka icon indicating copy to clipboard operation
woltka copied to clipboard

Is there a more robust way to infer alignment file format?

Open gwarmstrong opened this issue 4 years ago • 4 comments

I am a bit wary about inferring the file format like this. Are there more rigorous checks we could do? https://github.com/qiyunzhu/woltka/blob/862cab09e3da201cf8db20b4968849950fcc0fd5/woltka/align.py#L169-L203

gwarmstrong avatar Mar 19 '20 19:03 gwarmstrong

I currently do not have a good idea. Working with those large files is naturally hard. One cannot be very certain of the file format unless they read the entire file. Therefore, I tend to let the user be responsible for the integrity of the input file format. In the documentation I state clearly what are the options for input alignment files.

SAM and BLAST are two very common formats, and one can potentially write (or adopt if there is one on shelf already) a parser to check every column of a line. However the challenge is that many programs do NOT follow the strict format criteria. For example BURST tends to add extra taxonomic string after the last column of a BLAST line. Therefore, this definition has to be loose.

qiyunzhu avatar Mar 19 '20 20:03 qiyunzhu

I'm still getting ValueError: Cannot determine alignment file format. when using the centrifuge test data directory. Something wrong with the firstline of the test file? Also does not work on my own centrifuge output (which appears to have identical headers [firstline] to the test centrifuge files).

Thanks!

raplayer avatar Jun 08 '21 19:06 raplayer

welp, it looks like it is expecting a 2 column input, not raw centrifuge as input... interesting

raplayer avatar Jun 08 '21 20:06 raplayer

Hello @raplayer Thanks for reporting this issue. I believe that you have figured out that Centrifuge raw output is not currently supported as an alignment file format for Woltka. Actually, in the very beginning, Woltka was designed to work with Centrifuge outputs. You can still find the relevant code in Woltka, but we opted out in the formal release because we haven't tested it rigorously.

qiyunzhu avatar Jun 22 '21 17:06 qiyunzhu