seqtk icon indicating copy to clipboard operation
seqtk copied to clipboard

How to remove ambiguities?

Open Biomicrogen opened this issue 9 years ago • 4 comments

I know you can do this with seqtk, but I cannot find the command in the readme file? I want to change all the ambiguious codes to "N" in my fasta file?

Maybe seqtk seq -n N in.fasta > out.fasta??

Thanks! B

Biomicrogen avatar Sep 10 '15 18:09 Biomicrogen

Unfortunately, seqtk doesn't have this functionality. Perhaps it is best to write a script to do this.

lh3 avatar Sep 10 '15 19:09 lh3

Something like this Perl one-liner:

perl -pe 's/[^AGTC]/N/gi unless m/>/;'  old.fa > new.fa

If a line has a > leave it alone, otherwise replace all non-AGTC characters with N.

The -p option puts an implicit "read + print lines" loop around the file.

tseemann avatar Feb 13 '16 06:02 tseemann

Just a note to the above comment- that perl script also replaces newline characters with Ns. The following small modification seems to fix that:

perl -pe 's/[^AGTC\n]/N/gi unless m/>/;' old.fa > new.fa

MrOlm avatar Sep 15 '17 16:09 MrOlm

@MrOlm good point! Can also use sed:

% cat seq.fa

>good
AGTCTCTTC
>bad
AGRGAATNC

% sed '/^[^>]/ s/[^AGTC]/N/gi' < seq.fa

>good
AGTCTCTTC
>bad
AGNGAATNC

tseemann avatar Sep 16 '17 00:09 tseemann