seqtk
seqtk copied to clipboard
rename full header
Hello,
I'd like to use seqtk rename to rename fasta headers from something like ">k141_2 flag=3 multi=4.0678 len=200" to ">sequence1". My command is: "seqtk rename input_file.fa sequence > output_file.fa". However, the program only renames the first space-delimited field of the header, so I get ">sequence1 flag=3 multi=4.0678 len=200". Is it possible for the program to replace the entire header, not just the first field?
Thanks
It's simple, replace all the spaces with some other symbols before renaming:
$ echo -en ">k141_2 flag=3 multi=4.0678 len=200\nactg\n" | sed 's/ /_/g'
>k141_2_flag=3_multi=4.0678_len=200
actg
$ echo -en ">k141_2 flag=3 multi=4.0678 len=200\nactg\n" | sed 's/ /_/g' | seqtk rename - sequence
>sequence1
actg
Thanks for the workaround. Ideally for me, this would be incorporated into the seqtk program somehow since I'm calling it via python subprocess and would like to limit the shell piping.
Another workaround is calling seqtk twice, it may be still faster than a python script.
$ seqtk seq -C seqs.fa > seqs.fa.tmp
# >k141_2
# actg
$ seqtk rename seqs.fa.tmp sequence > result.fa
I see - the seq -C option drops the comments in the header. So the perfect solution (for me) would be having the -C option in the rename function. I think this would make sense as a future enhancement to the rename function, since they are both operations on the headers, not the sequences.
In view of modularization, a subcommand only does it's own task. And complex tasks can be done by piping multiple commands.
If you do want an one-command solution, here's one:
$ echo -en ">k141_2 flag=3 multi=4.0678 len=200\nactg\n" | seqkit replace -p ".+" -r "sequence{nr}"
>sequence1
actg
@nwespe instead of subprocess in python you could just rename the megahit
contigs with biopython or just pure python?
Here's my pseudo-code
in= open_for_read "orig"
out = open_for_write "new"
counter=0
while (line = in.getline)
if line.starts_with ">"
line = ">seq" + counter
counter++
end
out.putline line
end
Or even awk
or bioawk
@tseemann Thanks for the suggestion, I ended up implementing essentially your pseudocode. Piping the subprocess calls with error handling proved to be too complicated.
I'm wondering if there is any trick to retrieve the delimiter after the seq->name.s
. I want to retrieve the full header and the name.s + comment.s
isn't accurate due to the variable delimiter.
I got it worked by removing the + 1
in the following line. But, I don't know the consequences of that edit.
https://github.com/lh3/seqtk/blob/5806831b549e2fcbe5de2f78f5deb9673b75ea9d/kseq.h#L131
Don't know yet the consequences, but it's working fine here. https://github.com/mr-eyes/kseq/