sequenceserver icon indicating copy to clipboard operation
sequenceserver copied to clipboard

FASTA file detection is unreliable

Open yannickwurm opened this issue 3 years ago • 3 comments

In makeblastdb.rb,

we use the following:

def probably_fasta?(file)
      File.read(file, 1) == '>'
end

In rare cases, a file generated by makeblastdb begins with > despite not being a fasta. (e.g one of the index files).

I think we should simplify and do this detection based on:

  • common extensions for fasta (.cdna, .pep. .cds .fa .fasta. .fna ...)
  • SECONDARILY check for '>'

This will also make startup with -m faster

yannickwurm avatar Dec 17 '21 09:12 yannickwurm

If we have to involve file extensions, why not do the detection based on the file extension alone? Does checking that a file with .fa extension indeed begins with a > really help much?

Is the blast index file a binary file? Maybe the approach could be to check for text/non-binary files that start with >?

yeban avatar Dec 21 '21 21:12 yeban

We initially had cases where people would put non-fasta files in. We still have that where people put fastq files in. So the '>' checking - while being a crude shortcut - does retain value.

Checking only non-binary files may indeed be an relevant alternative approach.

yannickwurm avatar Dec 22 '21 11:12 yannickwurm

If instead we can simply check that the file is non-binary, then we can subsequently revert to our initial test (and checking file extensions is unnecessary)

yannickwurm avatar Dec 22 '21 19:12 yannickwurm