pyfaidx icon indicating copy to clipboard operation
pyfaidx copied to clipboard

Index failes if more than one empty line

Open openpaul opened this issue 4 years ago • 3 comments

Following the issue here: https://github.com/mdshw5/pyfaidx/issues/159 I found a new bug with a corner case fasta containing two empty lines. I made a minimal (not) working example here (version 0.5.8)

import pyfaidx
import os
print(pyfaidx.__version__) # 0.5.8
fasta = "empty.fa"
fai = "empty.fa.fai"
if os.path.exists(fai):
    os.remove(fai)
    
ok_1 =     ">prot_1\n\n>prot2\n\n"
ok_2 =     ">prot_1\n\n>prot2\nATGC\n\n"
fail_1 =     ">prot_1\n\n>prot2\n\n\n"
fail_2 =     ">prot_1\n\n\n>prot2\nATCG\n"
with open(fasta, "w") as fastafile:
    fastafile.write(fail_2)
    
faa  = pyfaidx.Fasta(fasta)
for seq in faa:
    print("{}: {}".format(seq.name, seq))

In this version of the bug, adding and empty line to an already empty sequence will fail the parser.

RuntimeError                              Traceback (most recent call last)
<ipython-input-50-09c005efce04> in <module>
     14     fastafile.write(fail_2)
     15 
---> 16 faa  = pyfaidx.Fasta(fasta)
     17 for seq in faa:
     18     print("{}: {}".format(seq.name, seq))

[...]/lib/python3.7/site-packages/pyfaidx/__init__.py in __init__(self, filename, default_seq, key_function, as_raw, strict_bounds, read_ahead, mutable, split_char, filt_function, one_based_attributes, read_long_names, duplicate_action, sequence_always_upper, rebuild, build_index)
    997             sequence_always_upper=sequence_always_upper,
    998             rebuild=rebuild,
--> 999             build_index=build_index)
   1000 
   1001         _record_constructor = MutableFastaRecord if self.mutable else FastaRecord


[...]/lib/python3.7/site-packages/pyfaidx/__init__.py in __init__(self, filename, default_seq, key_function, as_raw, strict_bounds, read_ahead, mutable, split_char, duplicate_action, filt_function, one_based_attributes, read_long_names, sequence_always_upper, rebuild, build_index)
    421                             self.indexname, self.filename), RuntimeWarning)
    422                 elif build_index:
--> 423                     self.build_index()
    424                     self.read_fai()
    425                 else:


[...]/lib/python3.7/site-packages/pyfaidx/__init__.py in build_index(self)
    523                         if line[0] == '>':
    524                             valid_entry = check_bad_lines(
--> 525                                 rname, bad_lines, i - 1)
    526                             if valid_entry and i > 0:
    527                                 indexfile.write(


[...]/lib/python3.7/site-packages/pyfaidx/__init__.py in check_bad_lines(rname, bad_lines, i)
   1270     raise RuntimeError("Unhandled exception during fasta indexing at entry " + rname + \
   1271                        "Please report this issue at https://github.com/mdshw5/pyfaidx/issues " + \
-> 1272                        str(bad_lines))
   1273 
   1274 def get_valid_filename(s):

RuntimeError: Unhandled exception during fasta indexing at entry prot_1Please report this issue at https://github.com/mdshw5/pyfaidx/issues [(1, 1), (2, 1)]

Note: When debugging this it sometimes only seems to fail if no .fai file exists already.

openpaul avatar Feb 06 '20 11:02 openpaul

Thanks so much for giving me a reproducible example. I'll have to consider what the expected behavior should be. So far I've been using the behavior of samtools (htslib) as a guide. In this case samtools seems to not do the correct thing:

$ samtools --version
samtools 1.7
Using htslib 1.7
$ samtools faidx fail_2.fa  # ">prot_1\n\n\n>prot2\nATCG\n"
$ cat fail_2.fa.fai
prot_1	0	10	-1	-1
prot2	4	17	4	5

So based on the index that samtools creates, it's getting confused about how many sequence characters and total characters are on each line. However, in this case the math works out correctly for the second entry and this is either purposeful or a happy accident as everything is fine when you fetch the sequence:

$ samtools faidx empty.fa prot_1
>prot_1
$ samtools faidx empty.fa prot2
>prot2
ATCG

Let's see what happens when we use that index with pyfaidx:

>>> faa  = pyfaidx.Fasta(fasta)
>>> faa[0]
FastaRecord("prot_1")
>>> faa[1]  
FastaRecord("prot2")

So far so good. What about getting the actual sequences?

>>> faa[0][:]
>prot_1:1-1

>>> faa[1][:]
>prot2:1-4
ATCG

Other than the questionable naming of the empty sequence slice it looks like the pyfaidx sequence retrieval code handles the samtools index just fine. It's also worth noting that the correct result is returned from the FastaRecord instance, except when reconstituting the "long name", which attempts to extract the entire FASTA record name including intervening white space:

>>> str(faa[0])
''
>>> faa[0].name
'prot_1'
>>> faa[0].long_name
'prot_1\n\n'

I'm guessing samtools has similar logic, aside from the "long name" which is something specific to this package. After looking at recent samtools releases I noticed that since version 1.9 there has been more error checking around empty sequences (https://github.com/samtools/samtools/pull/834), prompted by a similar issue (https://github.com/samtools/samtools/issues/513).

TLDR: I think I need to fix the indexing code in pyfaidx to produce a similar output to samtools in these cases, and should consider adding warnings to the faidx command line script the warn of truncated or empty sequences. @openpaul does this sound reasonable to you?

mdshw5 avatar Feb 06 '20 18:02 mdshw5

I also like the description of indexable FASTA files in the samtools documentation, but it says nothing about empty lines:

In order to be indexed with samtools faidx, a FASTA file must be a text file of the form >name [description...] ATGCATGCATGCATGCATGCATGCATGCAT GCATGCATGCATGCATGCATGCATGCATGC ATGCAT >name [description...] ATGCATGCATGCAT GCATGCATGCATGC [...]

In particular, each reference sequence must be "well-formatted", i.e., all of its sequence lines must be the same length, apart from the final sequence line which may be shorter. (While this sequence line length must be the same within each sequence, it may vary between different reference sequences in the same FASTA file.)

This also means that although the FASTA file may have Unix- or Windows-style or other line termination, the newline characters present must be consistent, at least within each reference sequence.

The samtools implementation uses the first word of the ">" header line text (i.e., up to the first whitespace character, having skipped any initial whitespace after the ">") as the NAME column.

mdshw5 avatar Feb 06 '20 18:02 mdshw5

Its interesting that empty lines are not specified. I would think they should be ignored by the actual script as they are invalid characters but pyfaidx should work as expected even on these weird files. A warning would be nice I think.

Just for clarity: I do not think this is really important as obviously this is an edge case that's just caused by another bioinformatics tool producing the wrong output. Sadly I can not file a bug report with them.

openpaul avatar Feb 07 '20 13:02 openpaul