seqkit icon indicating copy to clipboard operation
seqkit copied to clipboard

seqkit amplicon only keeps one amplicon

Open mlosilla opened this issue 4 years ago • 5 comments

Hi,

In version 0.15.0 seqkit amplicon only keeps one amplicon (the largest) per primer pair. I don't always care about the largest amplicon.

Would it be possible to implement the feature of keeping all valid amplicons? A bed file with the start and end bases (columns 2 and 3) of all valid pairs would be fantastic. That would permit me to inspect the amplicons and keep the one I want (in my case, the smallest amplicon over a certain threshold).

Thanks

mlosilla avatar Feb 22 '21 23:02 mlosilla

Right, PCR could produce all combinations of the forward and backward primers. We should output them too.

shenwei356 avatar Feb 23 '21 00:02 shenwei356

yeah exactly.

In my case, the positions in the bed file would be enough. I would choose the amplicon I want and use the positions as trimming points for my fastq reads--but the sequences of the PCR products may be useful for other users.

Thank you for considering it!

mlosilla avatar Feb 23 '21 14:02 mlosilla

In my case, the positions in the bed file would be enough.

Oh, you can use seqkit loate first, which outputs BED format.

shenwei356 avatar Mar 01 '21 03:03 shenwei356

期待这个功能早日上线

zjhzxjm avatar Nov 03 '22 01:11 zjhzxjm

Hi, I have the same issue here. Outputting the largest amplicon only was misleading in my case, as I was looking for an amplicon within repititive RNA genes of an eukaryotic genome. Instead of predicting a correct PCR amplicon of size around 400bp, seqkit amplicon produced an amplicon of ~50 Kb which does not make any sense in my case.

As a solution, I see that outputting all valid amplicons is the best option, or you may add an option expected_amplicon_size to give this length a priority or to make an upper limit for the amplicon prediction.

Here is a toy example

echo -ne ">seq\nAGTACCTTGGTAGGAGTTTCCTGCTAATGATAAGAATGATATTGGACTAAGTAATGTTGCAAATATAGAAACTGAT\n" | seqkit amplicon -F GGTAGG -R ATCAG
echo  "AGTACCTTGGTAGGAGTTTCCTGCTAATGATAAGAATGATATTGGACTAAGTAATGTTGCAAATATAGAAACTGAT" > seq 
echo -ne ">seq\n" > multi_seq.fa 
for seq in {1..10}; do cat seq >> multi_seq.fa; done 
cat multi_seq.fa | seqkit amplicon -F GGTAGG -R ATCAG | seqkit stats 
  • just seen that others have raised the same issue as well #384

MostafaYA avatar Apr 03 '24 12:04 MostafaYA