sequenceserver
sequenceserver copied to clipboard
FASTA file detection is unreliable
In makeblastdb.rb
,
we use the following:
def probably_fasta?(file)
File.read(file, 1) == '>'
end
In rare cases, a file generated by makeblastdb begins with >
despite not being a fasta. (e.g one of the index files).
I think we should simplify and do this detection based on:
- common extensions for fasta (.cdna, .pep. .cds .fa .fasta. .fna ...)
- SECONDARILY check for '>'
This will also make startup with -m
faster
If we have to involve file extensions, why not do the detection based on the file extension alone? Does checking that a file with .fa
extension indeed begins with a >
really help much?
Is the blast index file a binary file? Maybe the approach could be to check for text/non-binary files that start with >
?
We initially had cases where people would put non-fasta files in. We still have that where people put fastq files in. So the '>' checking - while being a crude shortcut - does retain value.
Checking only non-binary files may indeed be an relevant alternative approach.
If instead we can simply check that the file is non-binary, then we can subsequently revert to our initial test (and checking file extensions is unnecessary)