seqtk icon indicating copy to clipboard operation
seqtk copied to clipboard

rename full header

Open nwespe opened this issue 6 years ago • 10 comments

Hello,

I'd like to use seqtk rename to rename fasta headers from something like ">k141_2 flag=3 multi=4.0678 len=200" to ">sequence1". My command is: "seqtk rename input_file.fa sequence > output_file.fa". However, the program only renames the first space-delimited field of the header, so I get ">sequence1 flag=3 multi=4.0678 len=200". Is it possible for the program to replace the entire header, not just the first field?

Thanks

nwespe avatar Jun 11 '18 15:06 nwespe

It's simple, replace all the spaces with some other symbols before renaming:

$ echo -en ">k141_2 flag=3 multi=4.0678 len=200\nactg\n" | sed 's/ /_/g'
>k141_2_flag=3_multi=4.0678_len=200
actg

$ echo -en ">k141_2 flag=3 multi=4.0678 len=200\nactg\n" | sed 's/ /_/g' | seqtk rename - sequence
>sequence1
actg

shenwei356 avatar Jun 11 '18 16:06 shenwei356

Thanks for the workaround. Ideally for me, this would be incorporated into the seqtk program somehow since I'm calling it via python subprocess and would like to limit the shell piping.

nwespe avatar Jun 11 '18 21:06 nwespe

Another workaround is calling seqtk twice, it may be still faster than a python script.

$ seqtk seq -C seqs.fa > seqs.fa.tmp
# >k141_2
# actg

$ seqtk rename seqs.fa.tmp sequence > result.fa

shenwei356 avatar Jun 12 '18 09:06 shenwei356

I see - the seq -C option drops the comments in the header. So the perfect solution (for me) would be having the -C option in the rename function. I think this would make sense as a future enhancement to the rename function, since they are both operations on the headers, not the sequences.

nwespe avatar Jun 12 '18 15:06 nwespe

In view of modularization, a subcommand only does it's own task. And complex tasks can be done by piping multiple commands.

If you do want an one-command solution, here's one:

$ echo -en ">k141_2 flag=3 multi=4.0678 len=200\nactg\n" | seqkit replace -p ".+" -r "sequence{nr}"
>sequence1
actg

shenwei356 avatar Jun 15 '18 14:06 shenwei356

@nwespe instead of subprocess in python you could just rename the megahit contigs with biopython or just pure python?

Here's my pseudo-code

in= open_for_read "orig"
out = open_for_write "new"
counter=0
while (line = in.getline) 
   if line.starts_with ">"
     line = ">seq" + counter
     counter++
  end
  out.putline line
end

Or even awk or bioawk

tseemann avatar Jun 16 '18 11:06 tseemann

@tseemann Thanks for the suggestion, I ended up implementing essentially your pseudocode. Piping the subprocess calls with error handling proved to be too complicated.

nwespe avatar Jul 18 '18 22:07 nwespe

I'm wondering if there is any trick to retrieve the delimiter after the seq->name.s. I want to retrieve the full header and the name.s + comment.s isn't accurate due to the variable delimiter.

mr-eyes avatar Mar 07 '21 02:03 mr-eyes

I got it worked by removing the + 1 in the following line. But, I don't know the consequences of that edit.

https://github.com/lh3/seqtk/blob/5806831b549e2fcbe5de2f78f5deb9673b75ea9d/kseq.h#L131

mr-eyes avatar Mar 07 '21 15:03 mr-eyes

Don't know yet the consequences, but it's working fine here. https://github.com/mr-eyes/kseq/

mr-eyes avatar Mar 07 '21 18:03 mr-eyes