seqtk
seqtk copied to clipboard
Ability to process quality (.qual) files
It would be nice to have seqtk to be able to process quality files. Quality files are in the same format as FASTA files, except they have a space separating their quality scores. If seqtk is used to process any quality files, it will wrap the lines and merge two quality score values.
[deprekate@anthill ~]$ cat seq.fna
>seq1
CCGAATGGATCATCCCGACTTTCAGGCCGGGATGGCCGGCCTGAAAGGGGACTGGGAACT
CCTCTGCCGCCCCTTGTGCGACCCGGATGCCCCGCGCGGCTGGCTGGGGGTCTGGGCGCT
[deprekate@anthill ~]$ cat seq.qual
>seq1
35 35 50 50 44 44 44 43 44 43 55 55 55 44 50 42 42 42 42 43 52 52 52 52 52
52 42 52 52 44 52 39 39 40 43 43 55 44 52 42 42 42 42 42 52 55 55 55 55 55
subseq shows the bug, The sampled nucleotide sequence has 120 bases, while the quality score has 119 qualities. You can see the 5252
in the middle of the quality that got merged.
[deprekate@anthill ~]$ seqtk sample seq.fna 1
>seq1
CCGAATGGATCATCCCGACTTTCAGGCCGGGATGGCCGGCCTGAAAGGGGACTGGGAACTCCTCTGCCGCCCCTTGTGCGACCCGGATGCCCCGCGCGGCTGGCTGGGGGTCTGGGCGCT
[deprekate@anthill ~]$ seqtk sample seq.qual 1
>seq1
35 35 50 50 44 44 44 43 44 43 55 55 55 44 50 42 42 42 42 43 52 52 52 52 5252 42 52 52 44 52 39 39 40 43 43 55 44 52 42 42 42 42 42 52 55 55 55 55 55
An easy fix would be to have the option to replace newlines with spaces instead of just removing them entirely?
I would just use bioperl/biopython to convert all legacy FASTA+QUAL files to FASTQ.