abPOA icon indicating copy to clipboard operation
abPOA copied to clipboard

abPOA user specifiable seeds

Open benedictpaten opened this issue 2 years ago • 6 comments

Hi @yangao07 , I've been experimenting a little with the seeding in abpoa and am wondering if it would be possible to add an option for users to provide alignment seeds? My issue is that for more divergent sequences minimizers are not very ideal for anchoring. I have found more luck using maximal unique matches (MUMs), using a chaining process more like that in the original MUMmer program. Looking forward, I also see a time where we will want to anchor the alignments based upon unique markers in order to facilitate the alignment of highly repetitive sequences (e.g. satellite arrays). Interested in your perspective on this.

benedictpaten avatar Mar 24 '22 17:03 benedictpaten

Yes, theoretically, abPOA could take any type of seeding and chaining result to guide the POA process. I choose the minimizer simply out of speed consideration. Using a more mature seeding method (MUM) is definitely preferable for divergent sequences.

I think adding an option to take MUM seed/anchor as input is much easier than implementing it inside abPOA directly. Only concern is that we need a determined input format.

yangao07 avatar Mar 25 '22 09:03 yangao07

Hi Yan,

That is great news. As a strawman, I'd suggest using PAF format to take a set of pairwise anchors? Or do you prefer the anchors to be across multiple sequences?

On Fri, Mar 25, 2022 at 2:27 AM Yan Gao @.***> wrote:

Yes, theoretically, abPOA could take any type of seeding and chaining result to guide the POA process. I choose the minimizer simply out of speed consideration. Using a more mature seeding method (MUM) is definitely preferable for divergent sequences.

I think adding an option to take MUM seed/anchor as input is much easier than implementing it inside abPOA directly. Only concern is that we need a determined input format.

— Reply to this email directly, view it on GitHub https://github.com/yangao07/abPOA/issues/37#issuecomment-1078821603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEQ4IGE5CSRG5WWWL4NHODVBWBJPANCNFSM5RR3Y7WA . You are receiving this because you authored the thread.Message ID: @.***>

-- Benedict (calendar invites: @., appointments: Kimberley Czupil @. @.***>> or https://calendly.com/bpaten/30min)

benedictpaten avatar Mar 25 '22 15:03 benedictpaten

PAF format is nice. To feed abPOA, we only need to record which anchor comes from which sequence in the PAF file. Across multiple sequences may be too stringent, could lead to too few seeds. I think pairwise should be just fine. Specifically, we just need the anchors between every two adjacent sequences. The order could be the input order or the order determined by a progressive guide tree (you already knew this).

yangao07 avatar Mar 28 '22 03:03 yangao07

Yes, if you can create a function for this, then we can definitely specify use this. If you prefer to create some kind of object to define the seeds we can also work with that. Thanks,

Benedict

On Sun, Mar 27, 2022 at 8:39 PM Yan Gao @.***> wrote:

PAF format is nice. To feed abPOA, we only need to record which anchor comes from which sequence in the PAF file. Across multiple sequences may be too stringent, could lead to too few seeds. I think pairwise should be just fine. Specifically, we just need the anchors between every two adjacent sequences. The order could be the input order or the order determined by a progressive guide tree (you already knew this).

— Reply to this email directly, view it on GitHub https://github.com/yangao07/abPOA/issues/37#issuecomment-1080147439, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEQ4IHMS62TWARZJT626K3VCESVPANCNFSM5RR3Y7WA . You are receiving this because you authored the thread.Message ID: @.***>

-- Benedict (calendar invites: @., appointments: Kimberley Czupil @. @.***>> or https://calendly.com/bpaten/30min)

benedictpaten avatar Mar 28 '22 15:03 benedictpaten

I think for Cactus,it's important to have an API to pass the anchors in via a struct (as opposed to FILE*). Whether that struct is PAF-based or not is less important.

Also, if we are going to keep using abPOA's progressive ordering, then we'd need an API to get that (if it's not already there) before computing the mum anchors. Something like

[abpoa] get_progressive_order(sequences)
[cactus] compute_mum_anchors(sequences, order)
[abpoa] get_msa(sequences, anchors)

thanks!

glennhickey avatar Mar 28 '22 16:03 glennhickey

Yes, totally agree, Glenn.

On Mon, Mar 28, 2022 at 9:35 AM Glenn Hickey @.***> wrote:

I think for Cactus,it's important to have an API to pass the anchors in via a struct (as opposed to FILE*). Whether that struct is PAF-based or not is less important.

Also, if we are going to keep using abPOA's progressive ordering, then we'd need an API to get that (if it's not already there) before computing the mum anchors. Something like

[abpoa] get_progressive_order(sequences) [cactus] compute_mum_anchors(sequences, order) [abpoa] get_msa(sequences, anchors)

thanks!

— Reply to this email directly, view it on GitHub https://github.com/yangao07/abPOA/issues/37#issuecomment-1080881254, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEQ4ID3UCRU2ZYIFELFYIDVCHNTTANCNFSM5RR3Y7WA . You are receiving this because you authored the thread.Message ID: @.***>

-- Benedict (calendar invites: @., appointments: Kimberley Czupil @. @.***>> or https://calendly.com/bpaten/30min)

benedictpaten avatar Mar 28 '22 20:03 benedictpaten