woltka
woltka copied to clipboard
Is there a more robust way to infer alignment file format?
I am a bit wary about inferring the file format like this. Are there more rigorous checks we could do? https://github.com/qiyunzhu/woltka/blob/862cab09e3da201cf8db20b4968849950fcc0fd5/woltka/align.py#L169-L203
I currently do not have a good idea. Working with those large files is naturally hard. One cannot be very certain of the file format unless they read the entire file. Therefore, I tend to let the user be responsible for the integrity of the input file format. In the documentation I state clearly what are the options for input alignment files.
SAM
and BLAST
are two very common formats, and one can potentially write (or adopt if there is one on shelf already) a parser to check every column of a line. However the challenge is that many programs do NOT follow the strict format criteria. For example BURST tends to add extra taxonomic string after the last column of a BLAST line. Therefore, this definition has to be loose.
I'm still getting ValueError: Cannot determine alignment file format.
when using the centrifuge test data directory. Something wrong with the firstline of the test file? Also does not work on my own centrifuge output (which appears to have identical headers [firstline] to the test centrifuge files).
Thanks!
welp, it looks like it is expecting a 2 column input, not raw centrifuge as input... interesting
Hello @raplayer Thanks for reporting this issue. I believe that you have figured out that Centrifuge raw output is not currently supported as an alignment file format for Woltka. Actually, in the very beginning, Woltka was designed to work with Centrifuge outputs. You can still find the relevant code in Woltka, but we opted out in the formal release because we haven't tested it rigorously.