seqtk
seqtk copied to clipboard
Ability to exclude reads
It would be useful to have a function to exclude sequences from a file. Basically it would be the exact opposite of subseq
.
seqtk remove in.fq name.lst > out.fq
Only sequences with name NOT in name.lst are extracted.
This can be done in Unix like this:
cat in.fq | paste - - - - | grep -v -f -F name.lst | tr "\t" "\n" > out.fq
See my blog post for more details: http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html
That is a cool trick. I always seem to find new ways to use paste. But as a fix I added a fork, we will see if it gets accepted. It would be valueable, as the highest rated method is to use a QIIME script, which requires installing the whole QIIME package. I just like the ability to do the same/opposite task with the same tool:
seqtk subseq in.fa good.list
seqtk exclude in.fa bad.list
I chose exclude
over remove
since remove
is ambiguous, technically speaking subseq
is removing (it removes the reads in name.lst
from the in.fa
). Which is why I would switch the subseq
flag to extract
, as the term is more apt (the term subseq is more descriptive of splicing a read). But that changes the behavior of the original code, and most developer do not like changing already established functionality.
The traditional way in Unix to do these two types of subsets is to have a single command, but have a -v
flag to "invert" the behaviour:
seqtk subset in.fq good.list > good.fq
seqtk subset in.fq bad.list > bad.fq
And the inverse:
seqtk subset -v in.fq good.list > bad.fq
seqtk subset -v in.fq bad.list > good.fq
It would be good if samtools faidx
had a -v
option.
For fasta: I think a legal fasta is char wrapped, so this is safer until seqtk has a -v
option.
bioawk -cfastx '{printf(">%s\t%s\n", $name, $seq)}' in.fa | \
grep -v -f name.lst | tr "\t" "\n" > out.fa
This can be done in Unix like this:
cat in.fq | paste - - - - | grep -v -f -F name.lst | tr "\t" "\n" > out.fq
See my blog post for more details: http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html
Great solution! Btw I am suggesting an edit to be cat in.fq | paste - - - - | grep -v -F -f name.lst | tr "\t" "\n" > out.fq
since grep -f -F
might throw an error that -F
is treated as a file after -f
.
There is a -M
(mask) argument in seqtk seq
command, but the reads are soft masked, not being removed from the file.