GraphAligner icon indicating copy to clipboard operation
GraphAligner copied to clipboard

feature wishlist

Open ekg opened this issue 6 years ago • 12 comments

Here are some things that I'd love to see in GraphAligner:

  • Non-GAM textual output (GAF?) to simplify downstream processing of alignments
  • Independent aligner library (to embed in other tools)
  • Seeding based on minimizers from paths embedded in the graph, possibly via a GBWT index
  • Optimizations for short reads and chromosome scale contigs
  • Building independently from conda (with or without GAM/.vg support)

I'd love to help you implement these. Please let me know how I can help.

ekg avatar Nov 04 '19 09:11 ekg

  • Non-GAM textual output (GAF?) to simplify downstream processing of alignments

This sounds totally reasonable.

  • Independent aligner library (to embed in other tools)

I'd like to do this as well. It will take some refactoring so I'm not sure about the timeline.

  • Seeding based on minimizers from paths embedded in the graph, possibly via a GBWT index

This needs some way of picking the paths. Embedding the paths in the graph would work, it just needs to get that information from reading the GFA to building the minimizer index. Once we have the paths as vectors of node IDs the current minimizer code could be used with some changes.

  • Optimizations for short reads and chromosome scale contigs

This needs the path based indexing first. The condition for clipping the alignment will need some work as well. A good start would be to just get an example graph and dataset and look at how exactly it breaks for short reads.

  • Building independently from conda (with or without GAM/.vg support)

Could you give an example use case for this?

maickrau avatar Nov 05 '19 12:11 maickrau

Non-GAM textual output (GAF?) to simplify downstream processing of alignments

This sounds totally reasonable.

I saw you just released this. Nice! You're the second implementation. I'll add output for this to vg map and we can have three.

In the future, I'd like to work on a binary version of this, hopefully with more flexibility than SAM/BAM (the crisis over long CIGAR strings comes to mind) and ideally considering support for graph to graph alignment (this requires the query and targets to both be paths through graphs).

There are some remaining points of confusion for me about GAF. For instance, how do we encode structural variation, such as where we break an alignment mid-node and restart it far away? I think we might have to use the hard clip operators in the CIGAR strings to skip over parts of reference sequence nodes, listing them in order. This could produce very weird results when the nodes are extremely long. This is a little easier to deal with in the model where each node has its own alignment description, including a start position and orientation on the node. It's not too late for us to go down this route, and I think it has significant benefits for whole genome alignments. CC @lh3

Seeding based on minimizers from paths embedded in the graph, possibly via a GBWT index

This needs some way of picking the paths. Embedding the paths in the graph would work, it just needs to get that information from reading the GFA to building the minimizer index. Once we have the paths as vectors of node IDs the current minimizer code could be used with some changes.

Right. Realistically, this would have to happen over paths embedded in the graph. The GBWT could be used externally to support that. Also, @jltsiren has a tool that generates a covering path set of the graph. That could be useful when you don't have paths handy, to allow clustering to run across nodes.

Building independently from conda (with or without GAM/.vg support)

Could you give an example use case for this?

It would probably simplify packaging and future support. I've been learning about GNU Guix, which is an interesting alternative to conda that seems somewhat more future-resistant.

ekg avatar Nov 22 '19 14:11 ekg

For instance, how do we encode structural variation, such as where we break an alignment mid-node and restart it far away?

Like in SAM and PAF, you create two or more lines for one query sequence.

lh3 avatar Nov 24 '19 00:11 lh3

That works, but encourages information loss and ambiguity. I would prefer that SVs that are discovered directly by alignment should be described in a single line.

A whole genome aligner will make such alignments. Where the alignment is broken at SVs it might make sense to emit multiple lines, but that's not how all methods work.

There is a difference between a collection of alignments between sequences and graphs and an optimal alignment fully covering a sequence or graph query. Breaking up the alignment into fragments makes it difficult to communicate this.

ekg avatar Nov 24 '19 09:11 ekg

    Building independently from conda (with or without GAM/.vg support)

Could you give an example use case for this?

Can I add that conda is sometimes incompatible with existing installations of other software? In particular, conda does not provide functional linking for the pybind11 C++ library, which is required for Shasta to build properly. So GraphAligner and Shasta cannot easily coexist on the same machine.

And for anyone wanting to contribute to this project, building with conda is not ideal.

rlorigro avatar Mar 23 '20 18:03 rlorigro

We have a guix package model to build it. I think it is in my guix-genomics repo.

On Mon, Mar 23, 2020, 19:21 Ryan Lorig-Roach [email protected] wrote:

Building independently from conda (with or without GAM/.vg support)

Could you give an example use case for this?

Can I add that conda is sometimes incompatible with existing installations of other software? In particular, conda does not provide functional linking for the pybind11 C++ library, which is required for Shasta to build properly. So GraphAligner and Shasta cannot easily coexist on the same machine.

And for anyone wanting to contribute to this project, building with conda is not ideal.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maickrau/GraphAligner/issues/9#issuecomment-602774225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEK5GBX6DG6PYF2QWQTRI6SEFANCNFSM4JIQVNYQ .

ekg avatar Mar 23 '20 20:03 ekg

@maickrau I believe...

Optimizations for [...] chromosome scale contigs

... is what we just discussed as "should be possible to align whole genomes w/o chopping them up into smaller pieces first"

ptrebert avatar Apr 29 '21 14:04 ptrebert

  • Seeding based on minimizers from paths embedded in the graph, possibly via a GBWT index

This needs some way of picking the paths. Embedding the paths in the graph would work, it just needs to get that information from reading the GFA to building the minimizer index. Once we have the paths as vectors of node IDs the current minimizer code could be used with some changes.

In its current form does GraphAligner utilize the embedded path information in gfa files to assist in read alignment?

hgibling avatar Sep 18 '22 20:09 hgibling

The current version does not use the embedded paths.

maickrau avatar Sep 19 '22 05:09 maickrau

@maickrau Thank you for developing this awesome aligner! I wonder if there is a way to generate multiple hits besides the best alignment. If not, is it possible to include that in GraphAligner? It will be very helpful for some studies.

Best, Jianjun

Kinggerm avatar Feb 03 '23 04:02 Kinggerm

@Kinggerm GraphAligner already provides this feature. Secondary alignments are output by setting the parameter --multimap-score-fraction.

danydoerr avatar Feb 03 '23 09:02 danydoerr

@Kinggerm GraphAligner already provides this feature. Secondary alignments are output by setting the parameter --multimap-score-fraction.

Uh, I should have known. I'm sorry that I misunderstood it. Thank you very much for the comment.

Kinggerm avatar Feb 03 '23 14:02 Kinggerm