bedtools icon indicating copy to clipboard operation
bedtools copied to clipboard

maskfasta mangles files with mixed line-endings

Open bartgrantham opened this issue 4 years ago • 1 comments

I noticed this when I had concatenated a fasta with DOS line endings (0x0d 0x0a) to a normal (Unix, 0x0a) style fasta file.

The Unix-style part of the file was unchanged (except for the masking), but the DOS-style had the carriage return (0x0d) characters split up and thrown in haphazardly. I'm not sure if it actually changes the sequence, but it definitely changes the whitespace within the sequence. The .dict that Picard creates shows that it has a different md5 than doing the same operation with the sequence pre-converted to Unix style, but I'm not sure if Picard includes the whitespace characters in the calculation of the md5.

For example, it turned this:

>chrA
GTCTCAGTGGAAGTGCCTGCAGTTGTGGTGGCAGACTGTTCAGTTGCACT
GTCTCAGTGGAAGTGCCTGCAGTTGTGGTGGCAGACTGTTCAGTTGCACT
GTCTCAGTGGAAGTGCCTGCAGTTGTGGTGGCAGACTGTTCAGTTGCACT
>chrB^M
GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCG^M
GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCG^M
GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCG^M

into

>chrA
GTCTCAGTGGNNNNNNNNNNAGTTGTGGTGGCAGACTGTTCAGTTGCACT
GTCTCAGTGGAAGTGCCTGCAGTTGTGGTGGCAGACTGTTCAGTTGCACT
GTCTCAGTGGAAGTGCCTGCAGTTGTGGTGGCAGACTGTTCAGTTGCACT
>chrB^M
GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAA
GGCGTTTCCGTTCTTCTTCG^MGGGCGGCGACCTCGCGGGTTTTCGCTATT
TATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCG^MGGGCGGCG
ACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTC
CGTTCTTCTTCG^M

Note that the ^M characters (0x0d) are now separated from their line-end pairs. That could be a big problem, I have no idea how other tools would react to stray carriage return chars spread throughout the input. I would guess they'd do the obvious thing and just strip all whitespace, but then again I didn't expect bedtools to have this bug, either.

Now you may be wondering: Bart, why the hell do you have MS-DOS line endings in your fasta files?!? Great question, and I asked it, too. I am working entirely within Linux, and it was a bit of a puzzle to track back to how it happened. It turns out, and I had no idea this was the case, that the NCBI search API will serve up FASTA files with MS-DOS line endings. For example:

$ wget 'https://www.ncbi.nlm.nih.gov/search/api/sequence/NC_001416.1/?report=fasta' -O - | hd | tail -3

... wget output ...

0000c300  43 54 54 54 43 43 47 47  54 47 41 54 43 43 47 41  |CTTTCCGGTGATCCGA|
0000c310  43 41 47 47 54 54 41 43  47 0d 0a 0d 0a           |CAGGTTACG....|
0000c31d

So, should maskfasta convert the line-endings to unix, or should it preserve the line endings no matter what the user provides (even if they are mixed)? I'm inclined toward the latter. AFAICT a variety of downstream tools are robust to this situation, but may be sensitive when comparing references in a byte-by-byte fashion, so it's probably best not to edit them at all except for the expected masking.

I did not test if the masking coordinates are incorrectly calculated (ie. counting a 0x0d as a sequence character), but that would seem like a possible bug as well. I also didn't test if this bug appears in other fasta commands in bedtools, I'd guess it does.

bartgrantham avatar Oct 22 '19 03:10 bartgrantham

I just wanted to point out that AFAICT, Picard's md5 calculation is ok with bedtools' usual habit of reformatting input fasta to have 50 character-wide lines, but it definitely produces a different digest with the erroneous behavior described above.

bartgrantham avatar Oct 22 '19 03:10 bartgrantham