phyx
phyx copied to clipboard
Request to add feature to generate consensus sequences
Hi,
I would like to request more features for pxconsq
function in phyx :
-
have a flag (or option) that user can choose preferred symbol for gap? Like "-" or "N", or no gap generated in the consensus sequence (like the strict major rule consensus in Geneious? see examples below)
-
providing user defined consensus threshold value here. For example, in an alignment, the consensus base in a specific column depends on a threshold value. See explained here.
I think these features are useful when assembling target enrichment data, where people want consensus sequence for each gene as reference. So the direct pxconsq results will meet the need (current output containing too many Ns)
For example, I use one gene "g4471" from Angiosperms353_targetSequences.fasta to generate a strict consensus sequence as an example.
- the alignmet looks like this (I added quote "" to escape markdown format)
cat genes_mafft/g353_alignment/g4471_mafft.fasta
">AJFN-4471" "aatgttatacaggatgaagagaaactgaatactgcaaactccgattggatgcggaaatac aaaggctcaagtaagcttatgctccaacctaggagcaccgaggaggtttcacagatactt aaatattgtaattcgagacatcttgctgttgtcgtatgcgaagcaggatgcatattggaa aacttgatttcattcctagataatgaaggatttattatgccgttagatttgggtgcaaaa gggagttgtcaaattggtggaaatgtttcaacaaatgctgggggtttgcgccttgtccgt tatggatcacttcacgggaacgtacttggtctcgaagctgtttta---gcaaatggtact gttgttgacatgcttgggactttacgaaaagataatactgggtatgacctgaagcacttg tttataggaagtgaaggatctttgggattgataactaagatttccatacttacccctcca aagttatcttcagtaaatctagcttttcttgcttgtaaagattattacagttgccagaaa cttctatttgaagccaagaggaaacttggggaaattttgtctgcatttgagtttctggat gctcaatcactggatctggtcctgaaacatctagaaggtgctcggaatccattacctccc tcac---tacacaacttctatattctgattgagacaacaggcagtgatga------atct aatgac------------------------------------------------------ "------------------------------------------------------------" ...SKIP... ">TVSH-4471" "------------------------------------------------------------" "---------------------------------------------gtttctcagattctt" "aaatattgtaactccagaaacttggctgttgttgtatgtgaagctgggtgcatattggaa aatataatgtcattcctggacaatgaaggatttattatgccactagacttaggtgcaaaa gggagttgccagattggtggaaatgtttcaactaatgctggaggtttgcgtcttgttcgc tatggatcgcttcatggaagtgtacttggtatggaagctgttcta---gcagatggtact gtacttgacatgcttaagaccttgcgcaaagataatactggctatgatttgaaacatctg tttataggaagtgaaggttccttgggcattgttactaagatttcaatacttaccccacca aagttgtcttcagtaaatgtggcttttcttgcttgcaaagactatatcagctgccagaaa ttgctgcaggaggcaaaaaggaagcttggggagattttatctgcatttgaatttatggat gtccagtctatgaatttggttttaaaacacatggaaggtgcacgaaatccacta---cca tcat---tgcataacttttatgttttgattgagacaacaggcagtgatga------atct tctgacaaacaaaaactggaagcatttcttcttggctccatggagaatgaattgatatct gatggtgttcttgcacaagacataaaccaagcatcatctttttggcttctacgtgagggt" ">VUSY-4471 aaagtaattcaggatgaagagagactgcttactgcaaatatggattggatgcggaaatac aaaggctcaagtaagcttctgctccaacctaggagcactgaggaggtttcgcagattctt aaatactgtaattccagatgcctggctgttgttgtatgtgaggcaggatgcatattggaa aacctggtttctttccttgataatgaaggatttatcatgccactagacttgggtgcaaaa ggaagctgccaaattggtggaaatgtctcaactaatgctggtgggttgcgcttggtccgt tatggatcacttcatgggaatgtacttggtcttgaagctgtttta---gcaaatggtacc gtgcttgacattcttggaactttacgcaaagacaatactggatatgacttaaagcatttg tttataggaagtgaaggatccttgggaattgtgactaaggtctccatacttacccctccg aagctatcatcggtgaatctagcttttcttgcttgtaaagattatttcagctgccagaat cttctattggaagccaagaggaagcttggggaaattctatctgcatttgaatttttggat agccactcaatggatctggttctgaatcatctagaaggtgctcgaaatccattacctccc tcaa---tgcacaacttttatgttctgattgagacaacagggagtgatga------atcc tatgacaaagagaagcttgaggccttcctacttcattcaatggaaggtggtttgatatct gatggtgttcttgcacaagacataaatcaagcatcatcattttggcggattcgtgaggga" ">XFJG-4471 aatgttattcaagatgaagataggttgctggctgcaaatgtggattggatggggaaatat aaaggttctagccagcttttgctcttgccaaaaactactgaagaggtgtctaaaattctc caatactgcaattccaggcgcttggctgttgtcatttgcgaagctgg---------tgac aacctaaattcattcttagcaaatgaagggtttataatgccacttgatttgggagcaaaa ggaagctgtcaaattggtggaaacatatcaacaaatgctggaggtttgcacttcatacgt tacggatcactgcatggaaatattcttggccttgaagttgtctta---gctaatggaact gttcttgatatgcttactactttacgtaaagacaatacaggatatgacttgaagcattta ttcattggaagtgaaggtacattgggcattgtcacgaaggtctcaatactcacgcctcct aagctagtatcaaataacatcgcgtttcttgcttgtaaagacttttcaagttgtcagaaa ttactattggaggccaagagaggcttaggcgatgttatttctgcatttgaatttatggat agccattctatggatatggttttaaatcacttagagggcgtccgcaaccctttacctcca tcat---tatacaatttttatgttcttattgagacaaccagtagcgatga------atca tatgacaaagctaagcttgaagccttcttgttaagttacatggaagatggtctcatatca gatggtgttatagctcaggacatgaaccaagcttcttctttttggcgaatccgcgagggt"
-
If I use
pxconsq
the output consensus like this:
pxconsq -s genes_mafft/g353_alignment/g4471_mafft.fasta
">consensus NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGSRAVYRTNYTNGGYMTNGARGYWGTYHYRNNNSCHRAYGGNRHNVTNVTBGAYATKNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNKSHHTKRMHNTGGYHHTRHVHHAHHTRGANGGHSYNMRNRAYCCHBTRNNNBYHKYRNNNNNNNNNAAHTTYTATRTYBTRATYGAGACVACNNNNRGYRVHGANNNNNNNWCNHHTGAYNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN"
- While the 50% Major Rule consensus sequence generated by Geneious looks like this:
cat g353_consensus/g4471.fasta
">g4471_mafft_consensus_sequence AATGTRATTCAGGATGAAGABAGACTGBHDRCTGCAAATACRGATTGGATGCGTAAATACAAAGGCTCAAGTAAGCTTYTGCTCCAACCTAGGAGCACTGARGAGGTTTCTCAGATTCTTAAATACTGTAATTCYAGACGCTTGGCTGTTGTTGTATGTGAAGCAGGATGCATATTGGAAAATYTGGTTTCTTTCCTGGAYAATSAAGGATTTATTATGCCACTDGACTTRGGTGCAAAAGGAAGCTGCCAAATTGGTGGAAATGTTTCAACTAATGCTGGTGGTTTGCGCYTTGTCCGTTATGGATCACTTCATGGAAATGTACTTGGTCTTGAAGCTGTTTTAGCAAATGGTACTGTGCTTGACATGCTTGGGACTTTACGYAAAGATAATACTGGRTATGACTTGAAGCATTTGTTTATAGGAAGTGAAGGATCMTTGGGAATTGTMACTAAGGTTTCMATACTTACYCCTCCRAAGCTATCTTCAGTWAATSTWGCTTTTCTTGCWTGTAAAGATTATTTCAGCTGCCAGAAACTTCTATTGGAAGCCAAGAGGAARCTTGGRGAGATTCTMTCTGCATTTGAATTTTTGGATARCCADTCAATGGATYTGGTTCTGAATCATTTAGAAGGTGTTCGRAATCCATTACCTCCMTCAMTGCACAACTTTTATGTTCTGATTGAGACAACAGGCAGTGATGAATCTTATGACAAAGAGAAGCTTGAAGCYTTCCTACTTCGCTCAATGGAAGGTGGTTTGATATCTGATGGTGTTATTGCACAAGACATAAACCAAGCATCATCATTTTGGCGWATWCGTGAGGGT"
Please let me know if you have questions.
Thanks!
Miao
Kewl. Thanks for putting this together.
Hey @Cactusolo can you send me the complete file so I can match these expectations exactly? phylo dot jwb at gmail dot com
@josephwb done.