seqtk icon indicating copy to clipboard operation
seqtk copied to clipboard

Ability to exclude reads

Open deprekate opened this issue 9 years ago • 6 comments

It would be useful to have a function to exclude sequences from a file. Basically it would be the exact opposite of subseq.

seqtk remove in.fq name.lst > out.fq

Only sequences with name NOT in name.lst are extracted.

deprekate avatar Aug 13 '15 01:08 deprekate

This can be done in Unix like this:

cat in.fq | paste - - - - | grep -v -f -F name.lst | tr "\t" "\n" > out.fq

See my blog post for more details: http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html

tseemann avatar Feb 13 '16 06:02 tseemann

That is a cool trick. I always seem to find new ways to use paste. But as a fix I added a fork, we will see if it gets accepted. It would be valueable, as the highest rated method is to use a QIIME script, which requires installing the whole QIIME package. I just like the ability to do the same/opposite task with the same tool:

seqtk subseq in.fa good.list
seqtk exclude in.fa bad.list

I chose exclude over remove since remove is ambiguous, technically speaking subseq is removing (it removes the reads in name.lst from the in.fa). Which is why I would switch the subseq flag to extract, as the term is more apt (the term subseq is more descriptive of splicing a read). But that changes the behavior of the original code, and most developer do not like changing already established functionality.

deprekate avatar Feb 15 '16 22:02 deprekate

The traditional way in Unix to do these two types of subsets is to have a single command, but have a -v flag to "invert" the behaviour:

seqtk subset in.fq good.list > good.fq
seqtk subset in.fq bad.list > bad.fq

And the inverse:

seqtk subset -v in.fq good.list > bad.fq
seqtk subset -v in.fq bad.list > good.fq

It would be good if samtools faidx had a -v option.

tseemann avatar Apr 05 '16 22:04 tseemann

For fasta: I think a legal fasta is char wrapped, so this is safer until seqtk has a -v option.

bioawk -cfastx '{printf(">%s\t%s\n", $name, $seq)}' in.fa | \
  grep -v -f name.lst | tr "\t" "\n" > out.fa

conchoecia avatar Sep 20 '18 13:09 conchoecia

This can be done in Unix like this:

cat in.fq | paste - - - - | grep -v -f -F name.lst | tr "\t" "\n" > out.fq

See my blog post for more details: http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html

Great solution! Btw I am suggesting an edit to be cat in.fq | paste - - - - | grep -v -F -f name.lst | tr "\t" "\n" > out.fq since grep -f -F might throw an error that -F is treated as a file after -f.

edwwlui avatar Nov 29 '20 07:11 edwwlui

There is a -M (mask) argument in seqtk seq command, but the reads are soft masked, not being removed from the file.

y9c avatar Jun 23 '21 15:06 y9c