phyx icon indicating copy to clipboard operation
phyx copied to clipboard

Request to add feature to generate consensus sequences

Open Cactusolo opened this issue 5 years ago • 3 comments

Hi,

I would like to request more features for pxconsq function in phyx :

  1. have a flag (or option) that user can choose preferred symbol for gap? Like "-" or "N", or no gap generated in the consensus sequence (like the strict major rule consensus in Geneious? see examples below)

  2. providing user defined consensus threshold value here. For example, in an alignment, the consensus base in a specific column depends on a threshold value. See explained here.

I think these features are useful when assembling target enrichment data, where people want consensus sequence for each gene as reference. So the direct pxconsq results will meet the need (current output containing too many Ns)

For example, I use one gene "g4471" from Angiosperms353_targetSequences.fasta to generate a strict consensus sequence as an example.

  • the alignmet looks like this (I added quote "" to escape markdown format)

cat genes_mafft/g353_alignment/g4471_mafft.fasta

">AJFN-4471" "aatgttatacaggatgaagagaaactgaatactgcaaactccgattggatgcggaaatac aaaggctcaagtaagcttatgctccaacctaggagcaccgaggaggtttcacagatactt aaatattgtaattcgagacatcttgctgttgtcgtatgcgaagcaggatgcatattggaa aacttgatttcattcctagataatgaaggatttattatgccgttagatttgggtgcaaaa gggagttgtcaaattggtggaaatgtttcaacaaatgctgggggtttgcgccttgtccgt tatggatcacttcacgggaacgtacttggtctcgaagctgtttta---gcaaatggtact gttgttgacatgcttgggactttacgaaaagataatactgggtatgacctgaagcacttg tttataggaagtgaaggatctttgggattgataactaagatttccatacttacccctcca aagttatcttcagtaaatctagcttttcttgcttgtaaagattattacagttgccagaaa cttctatttgaagccaagaggaaacttggggaaattttgtctgcatttgagtttctggat gctcaatcactggatctggtcctgaaacatctagaaggtgctcggaatccattacctccc tcac---tacacaacttctatattctgattgagacaacaggcagtgatga------atct aatgac------------------------------------------------------ "------------------------------------------------------------" ...SKIP... ">TVSH-4471" "------------------------------------------------------------" "---------------------------------------------gtttctcagattctt" "aaatattgtaactccagaaacttggctgttgttgtatgtgaagctgggtgcatattggaa aatataatgtcattcctggacaatgaaggatttattatgccactagacttaggtgcaaaa gggagttgccagattggtggaaatgtttcaactaatgctggaggtttgcgtcttgttcgc tatggatcgcttcatggaagtgtacttggtatggaagctgttcta---gcagatggtact gtacttgacatgcttaagaccttgcgcaaagataatactggctatgatttgaaacatctg tttataggaagtgaaggttccttgggcattgttactaagatttcaatacttaccccacca aagttgtcttcagtaaatgtggcttttcttgcttgcaaagactatatcagctgccagaaa ttgctgcaggaggcaaaaaggaagcttggggagattttatctgcatttgaatttatggat gtccagtctatgaatttggttttaaaacacatggaaggtgcacgaaatccacta---cca tcat---tgcataacttttatgttttgattgagacaacaggcagtgatga------atct tctgacaaacaaaaactggaagcatttcttcttggctccatggagaatgaattgatatct gatggtgttcttgcacaagacataaaccaagcatcatctttttggcttctacgtgagggt" ">VUSY-4471 aaagtaattcaggatgaagagagactgcttactgcaaatatggattggatgcggaaatac aaaggctcaagtaagcttctgctccaacctaggagcactgaggaggtttcgcagattctt aaatactgtaattccagatgcctggctgttgttgtatgtgaggcaggatgcatattggaa aacctggtttctttccttgataatgaaggatttatcatgccactagacttgggtgcaaaa ggaagctgccaaattggtggaaatgtctcaactaatgctggtgggttgcgcttggtccgt tatggatcacttcatgggaatgtacttggtcttgaagctgtttta---gcaaatggtacc gtgcttgacattcttggaactttacgcaaagacaatactggatatgacttaaagcatttg tttataggaagtgaaggatccttgggaattgtgactaaggtctccatacttacccctccg aagctatcatcggtgaatctagcttttcttgcttgtaaagattatttcagctgccagaat cttctattggaagccaagaggaagcttggggaaattctatctgcatttgaatttttggat agccactcaatggatctggttctgaatcatctagaaggtgctcgaaatccattacctccc tcaa---tgcacaacttttatgttctgattgagacaacagggagtgatga------atcc tatgacaaagagaagcttgaggccttcctacttcattcaatggaaggtggtttgatatct gatggtgttcttgcacaagacataaatcaagcatcatcattttggcggattcgtgaggga" ">XFJG-4471 aatgttattcaagatgaagataggttgctggctgcaaatgtggattggatggggaaatat aaaggttctagccagcttttgctcttgccaaaaactactgaagaggtgtctaaaattctc caatactgcaattccaggcgcttggctgttgtcatttgcgaagctgg---------tgac aacctaaattcattcttagcaaatgaagggtttataatgccacttgatttgggagcaaaa ggaagctgtcaaattggtggaaacatatcaacaaatgctggaggtttgcacttcatacgt tacggatcactgcatggaaatattcttggccttgaagttgtctta---gctaatggaact gttcttgatatgcttactactttacgtaaagacaatacaggatatgacttgaagcattta ttcattggaagtgaaggtacattgggcattgtcacgaaggtctcaatactcacgcctcct aagctagtatcaaataacatcgcgtttcttgcttgtaaagacttttcaagttgtcagaaa ttactattggaggccaagagaggcttaggcgatgttatttctgcatttgaatttatggat agccattctatggatatggttttaaatcacttagagggcgtccgcaaccctttacctcca tcat---tatacaatttttatgttcttattgagacaaccagtagcgatga------atca tatgacaaagctaagcttgaagccttcttgttaagttacatggaagatggtctcatatca gatggtgttatagctcaggacatgaaccaagcttcttctttttggcgaatccgcgagggt"

  • If I use pxconsq the output consensus like this:

pxconsq -s genes_mafft/g353_alignment/g4471_mafft.fasta

">consensus NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGSRAVYRTNYTNGGYMTNGARGYWGTYHYRNNNSCHRAYGGNRHNVTNVTBGAYATKNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNKSHHTKRMHNTGGYHHTRHVHHAHHTRGANGGHSYNMRNRAYCCHBTRNNNBYHKYRNNNNNNNNNAAHTTYTATRTYBTRATYGAGACVACNNNNRGYRVHGANNNNNNNWCNHHTGAYNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN"

  • While the 50% Major Rule consensus sequence generated by Geneious looks like this:

cat g353_consensus/g4471.fasta

">g4471_mafft_consensus_sequence AATGTRATTCAGGATGAAGABAGACTGBHDRCTGCAAATACRGATTGGATGCGTAAATACAAAGGCTCAAGTAAGCTTYTGCTCCAACCTAGGAGCACTGARGAGGTTTCTCAGATTCTTAAATACTGTAATTCYAGACGCTTGGCTGTTGTTGTATGTGAAGCAGGATGCATATTGGAAAATYTGGTTTCTTTCCTGGAYAATSAAGGATTTATTATGCCACTDGACTTRGGTGCAAAAGGAAGCTGCCAAATTGGTGGAAATGTTTCAACTAATGCTGGTGGTTTGCGCYTTGTCCGTTATGGATCACTTCATGGAAATGTACTTGGTCTTGAAGCTGTTTTAGCAAATGGTACTGTGCTTGACATGCTTGGGACTTTACGYAAAGATAATACTGGRTATGACTTGAAGCATTTGTTTATAGGAAGTGAAGGATCMTTGGGAATTGTMACTAAGGTTTCMATACTTACYCCTCCRAAGCTATCTTCAGTWAATSTWGCTTTTCTTGCWTGTAAAGATTATTTCAGCTGCCAGAAACTTCTATTGGAAGCCAAGAGGAARCTTGGRGAGATTCTMTCTGCATTTGAATTTTTGGATARCCADTCAATGGATYTGGTTCTGAATCATTTAGAAGGTGTTCGRAATCCATTACCTCCMTCAMTGCACAACTTTTATGTTCTGATTGAGACAACAGGCAGTGATGAATCTTATGACAAAGAGAAGCTTGAAGCYTTCCTACTTCGCTCAATGGAAGGTGGTTTGATATCTGATGGTGTTATTGCACAAGACATAAACCAAGCATCATCATTTTGGCGWATWCGTGAGGGT"

Please let me know if you have questions.

Thanks!

Miao

Cactusolo avatar Jun 18 '19 03:06 Cactusolo

Kewl. Thanks for putting this together.

josephwb avatar Jun 18 '19 16:06 josephwb

Hey @Cactusolo can you send me the complete file so I can match these expectations exactly? phylo dot jwb at gmail dot com

josephwb avatar Jun 18 '19 17:06 josephwb

@josephwb done.

Cactusolo avatar Jun 19 '19 03:06 Cactusolo