seqtk
seqtk copied to clipboard
How to remove ambiguities?
I know you can do this with seqtk, but I cannot find the command in the readme file? I want to change all the ambiguious codes to "N" in my fasta file?
Maybe seqtk seq -n N in.fasta > out.fasta??
Thanks! B
Unfortunately, seqtk doesn't have this functionality. Perhaps it is best to write a script to do this.
Something like this Perl one-liner:
perl -pe 's/[^AGTC]/N/gi unless m/>/;' old.fa > new.fa
If a line has a >
leave it alone, otherwise replace all non-AGTC characters with N.
The -p
option puts an implicit "read + print lines" loop around the file.
Just a note to the above comment- that perl script also replaces newline characters with Ns. The following small modification seems to fix that:
perl -pe 's/[^AGTC\n]/N/gi unless m/>/;' old.fa > new.fa
@MrOlm good point! Can also use sed
:
% cat seq.fa
>good
AGTCTCTTC
>bad
AGRGAATNC
% sed '/^[^>]/ s/[^AGTC]/N/gi' < seq.fa
>good
AGTCTCTTC
>bad
AGNGAATNC