abPOA icon indicating copy to clipboard operation
abPOA copied to clipboard

GFA output

Open ekg opened this issue 3 years ago • 15 comments

Have you considered adding GFA output to abPOA?

I am likely to make a patch to do this, but I wouldn't mind if you beat me to it. 🙂

ekg avatar Sep 19 '20 05:09 ekg

Thanks for the suggestion! That should not be very hard to do, I will give it a try!

yangao07 avatar Sep 19 '20 08:09 yangao07

Ideally it would match what's in spoa. It records the walks made by the sequences through the graph as P lines. Should be straightforward if you're making the MSA.

On Sat, Sep 19, 2020, 10:25 Yan Gao [email protected] wrote:

Thanks for the suggestion! That should not be very hard to do, I will give it a try!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yangao07/abPOA/issues/2#issuecomment-695183408, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIFKW53XKERRIENQITSGRTHLANCNFSM4RS2AGKQ .

ekg avatar Sep 19 '20 11:09 ekg

Hi Erik,

I just updated the abPOA repo, please try out the GFA format output with parameter '-r4'. Let me know if this works for you!

Yan

yangao07 avatar Sep 22 '20 03:09 yangao07

Hey! This looks great. Here's one of the graphs I made with it.

abpoa ~/HLA-zoo/seqs/DRB1-3123.fa -r 4 >DRB1-3123.abpoa.gfa

In Bandage:

graph

This is the view of the graph given by odgi viz:

odgi build -g DRB1-3123.abpoa.gfa -o - | odgi unchop -i - -o - | odgi sort -i - -o - >DRB1-3123.abpoa.og
odgi viz -i DRB1-3123.abpoa.og -b -w 20 -y 500 -o DRB1-3123.abpoa.og.viz.png

DRB1-3123 abpoa og viz

And this will show if there are any inversions in the graph.

odgi viz -i DRB1-3123.abpoa.og -b -w 20 -y 500 -o DRB1-3123.abpoa.og.viz.inv.png -z

DRB1-3123 abpoa og viz inv

I was surprised to not see any, because one of the sequences in the set is inverted relative to the others.

edyeet -X -n 20 -N -n 10 DRB1-3123.fa DRB1-3123.fa
...
gi|345525392:5000-18402 13403   0       13403   -       gi|568815567:3779003-3792415    13413   2       13404   13391   13403   33      id:f:0.999552   ma:i:13391      mm:i:1  ni:i:0  nd:i:10ns:i:11 ed:i:22 al:i:13413      se:f:0.0016402  cg:Z:4894=10D6889=1X1608=11I

This test is not possible to complete using spoa, or I would show a direct comparison. You can see the orientation in the pggb output though:

pggb -i DRB1-3123.fa -s 3000 -p 70 -a 70 -n 10 -t 16 -v -l -k 8

DRB1-3123 fa pggb-s3000-p70-n10-a70-K16-k8-w10000-j5000-W0-e100 smooth og viz inv

So it would seem that when you write the sequence in the GFA, you'd want to indicate that it's inverting if it aligns in the reverse orientation. I patched spoa to support this as well, so it shouldn't be hard to fix here.

By the way, the direct abPOA output is much nicer than pggb here. In pggb, the smoothxg step can't handle very large graphs due to the memory requirements of the exact alignment calculation in spoa. Swapping in abPOA should resolve that to some extent.

This is the input if you'd like to test: https://github.com/ekg/HLA-zoo/blob/master/seqs/DRB1-3123.fa.

ekg avatar Sep 22 '20 11:09 ekg

Hi @ekg ,

I added a minimizer-based seeding in the latest version of abPOA (v1.2.0). It can reduce the memory usage for very long input sequence, e.g this DRB1. Also, it should produce similar or even better alignment results.

Yan

yangao07 avatar May 15 '21 10:05 yangao07

That sounds really useful. Limitations on the length of the POA length cause a lot of problems. What kind of procedure are you using for the seeding?

On Sat, May 15, 2021, 12:29 Yan Gao @.***> wrote:

Hi @ekg https://github.com/ekg ,

I added a minimizer-based seeding in the latest version of abPOA ((v1.2.0)[ https://github.com/yangao07/abPOA/releases/tag/v1.2.0]). It can reduce the memory usage for very long input sequence, e.g this DRB1.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/yangao07/abPOA/issues/2#issuecomment-841636487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIA64UG5VSF6O3N5PTTNZEHZANCNFSM4RS2AGKQ .

ekg avatar May 15 '21 20:05 ekg

The seeding step is actually very simple. I collected all the minimizer hits between two sequences based on the input order. Then, I applied a similar chaining strategy as used in minimap2 to find all the co-linear chains, but not allowing large gaps. Lastly, all the non-overlapping co-linear chains are chained together to produce the final chaining result, which guides the POA in the next step.

yangao07 avatar May 16 '21 05:05 yangao07

How do you choose the reference sequence for the seeding?

On Sun, May 16, 2021, 07:43 Yan Gao @.***> wrote:

The seeding step is actually very simple. I collected all the minimizer hits between two sequences based on the input order. Then, I applied a similar chaining strategy as used in minimap2 to find all the co-linear chains, but not allowing large gaps. Lastly, all the non-overlapping co-linear chains are chained together to produce the final chaining result, which guides the POA in the next step.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/yangao07/abPOA/issues/2#issuecomment-841772027, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIMDUQRKTIA5YAP6V3TN5LRJANCNFSM4RS2AGKQ .

ekg avatar May 16 '21 09:05 ekg

Based on the input order, the i-th sequence is the reference for the (i+1)-th.

yangao07 avatar May 16 '21 12:05 yangao07

That's very pragmatic. Other strategies might be more sensitive, but could have much higher performance costs. If we can order the input sequences as they would sit in a neighbor joining tree, we might be able to get that to perform as well as possible.

On Sun, May 16, 2021 at 2:05 PM Yan Gao @.***> wrote:

Based on the input order, the i-th sequence is the reference for the (i+1)-th.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/yangao07/abPOA/issues/2#issuecomment-841808176, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEONFZV2X56LUOM45CDTN6YIDANCNFSM4RS2AGKQ .

ekg avatar May 16 '21 13:05 ekg

What are the longest sequences you've tested this on?

On Sun, May 16, 2021 at 3:44 PM Erik Garrison @.***> wrote:

That's very pragmatic. Other strategies might be more sensitive, but could have much higher performance costs. If we can order the input sequences as they would sit in a neighbor joining tree, we might be able to get that to perform as well as possible.

On Sun, May 16, 2021 at 2:05 PM Yan Gao @.***> wrote:

Based on the input order, the i-th sequence is the reference for the (i+1)-th.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/yangao07/abPOA/issues/2#issuecomment-841808176, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEONFZV2X56LUOM45CDTN6YIDANCNFSM4RS2AGKQ .

ekg avatar May 16 '21 13:05 ekg

The longest one is actually just this DRB1.

yangao07 avatar May 17 '21 01:05 yangao07

I will share some slices of a human chromosome that are from 10kb to 1mbp.

On Mon, May 17, 2021, 03:44 Yan Gao @.***> wrote:

The longest one is actually just this DRB1.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/yangao07/abPOA/issues/2#issuecomment-841921734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQENLQABILD7Q2DPKXFTTOBYIBANCNFSM4RS2AGKQ .

ekg avatar May 17 '21 06:05 ekg

That will be great. Curious what will happen if we feed abPOA with that long sequences.

yangao07 avatar May 17 '21 07:05 yangao07

Just updated to v1.2.1. Removes a redundant and very time-consuming sorting step.

yangao07 avatar May 18 '21 08:05 yangao07