MINTIE icon indicating copy to clipboard operation
MINTIE copied to clipboard

Replace gmap with minimap2 in align_contigs_against_genome

Open nadiadavidson opened this issue 2 years ago • 2 comments

Many false positives seem to be a result of poor alignment of contigs to the genome, which are resulting is bad annoation. e.g. k49_1134199 0 chr21 44220603 3 281S16M6810N5M2435N6M6199N13M50029N16M1I6M32I32M21I4M4911N3M16052N8M8047N10M2202N2I11M1I25M3I22M6D11M750N10M1163N4M1809N3M1264N7M24156N4M6825N5M3920N2M12899N7M4539N3M7819N8M10199N7M1D1M4453N8M5667N2M11730N3M55386N4M402730N2M6445N3I6M15389N1M13514N9M1I47M2N4I126M15D1279M * 00 CAGCGCTCCTGGCCCCCCGAAGTCCCAGAGCTGCTGACCCCCACCCCAGCTGCATCAGAGAGCCTGTCTGGGGCCAAGGTTGCCAGAGATTTCTGAAGACACAGCTTGTTCCTTGTTCTTGGCTGGTGGGTGCACAAGGACTTCTGGAAGGGATTTAGACGGGGCTGAGTGCTAGGATTAAAGTGGGGATGGGAGTACGGCAACAGAAAAACCTGGGAGCTAGCAATGCACCCAGCCCTTGACTGTGCCCTGGTGGACAGCCGAGCTGTGGCTCTAGCGTGAGCCAGTGCCTTCCTGTCCCTGCCAAGGGTGAGGCCAGAGTTGGCCCCGAGGCTAATGTTTCAGTGGGTGAGATTAGGTCGGCCGTACAGAGGCCGGTGGGCTCCCTGACATCCCTTCCAGGCAACCTGAAAGCACTGAAATAGCTTATGGCCCTGTGCCAGGGACCTTGGCCCAAGCTGCTGACCTCCAGGGTGGGGAGGGAGCTACCCCCAGGAGAAGAGTCACTCAGACAGCAGTATGAGCAAGCCAGCCAGCAGCTCCGTGCCTGCACCCAGCTCAGGGGAATCCCAGGGGGTTCAGATGCCCAGGAAGGAAAAGGGGACAGCGCTACTGCTATGGAATGAGACCACCACTTCTCCTGTTGTCCTTCCCAGCTTCTCCCCAACCTCCCCTTTTCCCTAGTTTATAAGACAGGAGAAAAGGGAGAAAGCAAAAAGCTGGAAAGAAACAGAAGTAAGATAAATAGCTAGACGACCTTGGCGCCACCACCTGGCCCTGGTGGTTAAAATGATAATAATATTAACCCCTGACCAAAACGACTGGTGTTATCTGTAAATCCCAGACATTGTGTGAGAAAGCACCGTAAAACTTTTTGTCCTATTAGCTGATGTGTGTAGCCCCCAGTCACGTTCCTCACGCTTACTTGATCTATTATGACCCTTTCACGTGGACCCCTTAGAGTTGTAAGCTCTTAAAAGGGCTAGGAATTTCTTTTTCGGGGAGCTCGGCTCTTAAGACGCAAGTCTGCTGACACTCCTGGCCAAATAAAGCCCTTCCTTCTTTAACCGAGTGTCTGAGGAATTCTGTCTGCGGCTTGTCCGGCTACAACGGTGCTGGAGCCCAGACTCTCAGGGAAAGGAACCCGAGCCGTCAGAAAACCATCTGATTCCAGGCTGGGGCAAGGGACATGGAGATGGGCCTGCAGCATCATGTTGCTCCAGAAAGCAAGAAAGTGCTCAGAACGGTAGAACGGGGATGCATGGACAGGACACGCAGCCAGACCTAGCGGATTTGAGCATCTCGGGGAAGAAAGGACAGCCACAGATCATGCACTACTGAACAAAATAAAACTGTGGGTCACGCTGATGAGAGAGAGGCTGCAGAGAAGGAGAGACCCTTCCTTAGGTTGGCAGCCGTGAGTGGCAGGCGGGGACCAGCACGGCACCAATCTGCAGCCATCGCAGTGATGGCGGCTTCAGGCGGGGACCTCCGCGGATGCTGAGCCTGCGGGTGCGATTTGATGAGGGCAGAACCTCACCAGCCCACAGTGGCTGCGAGGGGATCATGCAGCGGGATGGGGAGGCCGGGGGGATGCCGTCTCAGCAGAGCCGTCCACGCTGACCTCATCAAGACTGGGACGGGGCCACAGCAGTGCCTCTCATGGGCACTTAGGACACCGTCACTGAGGGGCTCCTGCCAAAGCACACCTGAGTCCAGGCAGAGGAAACTCCAGACAAGACCCCCGAGGGTCATGCTACAAAGCTGCTCTCCTGACTTCCTCAGAAACGCCCAAGGACAGGAAAGACAAAGAAAGCTGAGGACTTGTCCAGATTCAAGAAGCCCAAGGAGACGGCTGAGCGTAGGGCGAGCCTGGGTGAGGAGATTCAGAGCGTTAGACGGCTGAGCGCAGTGTGTGAACCTGGGTTAGGAGATTTGGGGCCTGAGATGGCTGAGTGCAGGGTGAGCCTGAGTGAGGAGATTCTGAGCCTGAGACAGCTGAGCACAGGGTGAGCCTGGGTGACAAAATCCACCAGGAAAATATGCTCACGAAGACATCATTGGGACAACCAATAAAATATGCGT * MD:Z:35AG4G1T8AC19C1C1CG3CC1GCT2CC2A7G24T4TCT3A2CCTC2GCT1A1T1T6C1T2TGAGGG2C1^GGGACA1CA1G48G17^G4A1G43C2C3C3TT7A1CC33A14T29C3T18C16C1A6T5TT^ATTATTATTATTAAC13T19T11A3A6TT1C8C3T2G1C8CA4A10A5C3A3G2CT6C1CA9T23C313C14A784 NH:i:1HI:i:1 NM:i:175 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU

The read should align to chr21:43268915-43270392 and chr2:231884096-231893280

Replace the following stage with minimap2 could be a simple improvement (but require a bunch of validation work). Keen on your thoughts @mcmero

align_contigs_against_genome = { def sample_name = branch.name output.dir = sample_name produce('aligned_contigs_against_genome.sam'){ exec """ $gmap -D $gmap_refdir -d $gmap_genome -f samse -t $threads -x $min_gap --max-intronlength-ends=500000 -n 0 $input.fasta > $output """, "align_contigs_against_genome" } }

nadiadavidson avatar Aug 10 '23 02:08 nadiadavidson

Thanks Nadia, I think replacing GMAP with minimap2 could work nicely to improve the contig alignments. Happy to implement in a separate branch. As you say, validation would be a lot of work, so would need to discuss further.

mcmero avatar Aug 21 '23 04:08 mcmero

Thanks Marek, we'll have a go on a small dataset and then chat with you more if it looks promising. No need to make a separate branch at this stage.

nadiadavidson avatar Aug 22 '23 03:08 nadiadavidson