BandageNG icon indicating copy to clipboard operation
BandageNG copied to clipboard

Support larger GFAs

Open asl opened this issue 2 years ago • 11 comments

Large genome / metagenome could generate very huge assembly graph. We need to see how to load them properly.

Few ideas straight from the head:

  • Switch to GFA parser from https://github.com/lh3/gfatools Our experience show that it's pretty efficient.
  • Some additional indexing could be done, so we won't need to keep the whole graph in the memory

asl avatar Sep 22 '21 04:09 asl

The main problem is storage of the graph in memory. I would suggest switching to a library based on succinct data structures. Libhandlegraph and odgi both have suitable implementations. These libraries provide parsing that's plenty efficient. They also support path queries.

On Wed, Sep 22, 2021, 06:15 Anton Korobeynikov @.***> wrote:

Large genome / metagenome could generate very huge assembly graph. We need to see how to load them properly.

Few ideas straight from the head:

  • Switch to GFA parser from https://github.com/lh3/gfatools Our experience show that it's pretty efficient.
  • Some additional indexing could be done, so we won't need to keep the whole graph in the memory

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/asl/Bandage/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQENVNMFQH4ULGKS3X7LUDFJ6VANCNFSM5EQK2WCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ekg avatar Sep 22 '21 06:09 ekg

Well, surprisingly enough, the internal storage of gfatools is quite efficient (as it's essentially the parsed GFA) - we do use gfatools for GFA parsing in SPAdes is it works quite well. One step at a time :)

asl avatar Sep 22 '21 06:09 asl

It's not just about putting it in memory, but about providing efficient random access to its parts. The paths are particularly ruinous in this regard. You can have fast path iteration if you've just loaded the GFA into memory (you can iterate over the steps in a path), but it's ultra expensive to find all the paths at a given node.

A good interface for this is gbwtgraph: https://github.com/jltsiren/gbwtgraph into Bandage. That typically uses less memory than the GFA, and it'll provide random access, path iteration, and path pattern matching queries (e.g. subpath frequency).

That said, any dependency-light way to get the needed access patterns is good!

On Wed, Sep 22, 2021 at 8:10 AM Anton Korobeynikov @.***> wrote:

Well, surprisingly enough, the internal storage of gfatools is quite efficient (as it's essentially the parsed GFA) - we do use gfatools for GFA parsing in SPAdes is it works quite well. One step at a time :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/asl/Bandage/issues/5#issuecomment-924616093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOVYACU4GZFKRKEULDUDFXM3ANCNFSM5EQK2WCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ekg avatar Sep 22 '21 08:09 ekg

@ekg It seems that gbwtgraph is not quite suitable as a generic graph interface. Especially the following treatment of GFAs:

Overlaps, containments, and tags are ignored.
Links are induced by the paths; the tool ignores L-lines.

is certainly pangenome-centric and certainly are not suitable for generic GFAs including e.g. assembly graphs.

asl avatar Feb 09 '22 08:02 asl

I see, you're totally right!

Alternatives are odgi and libhandlegraph. The PackedGraph model is especially efficient.

ekg avatar Feb 09 '22 13:02 ekg

Sorry, specifically you'll want libbdsg: https://github.com/vgteam/libbdsg. libhandlegraph is an abstract interface in C++, not an implementation.

ekg avatar Feb 09 '22 13:02 ekg

Adding overlap lengths to the link lines would be the missing part, otherwise I think it should be more directly applicable to generic GFAs that gbwtgraph.

ekg avatar Feb 09 '22 13:02 ekg

#17 Replaced old regexp-based GFA parser with one from GFA tools. Significant speedup was achieved on this step

asl avatar Mar 30 '22 08:03 asl

https://github.com/asl/Bandage/pull/33 Added stream-oriented GFA parser with almost zero copies in the most cases. As a result of all recent changes the memory consumption and runtime dropped 5x-10x depending on the graph

asl avatar May 31 '22 17:05 asl

For an example of a library that uses vg's bidirected sequence graph (bdsg) library, and also maintains overlap data, you could check out https://github.com/vgteam/GetBlunted

bdsg has PackedGraph and HashGraph implementations.

rlorigro avatar Jun 02 '22 18:06 rlorigro

I doubt Bandage is ready for huge change of internal representation due to ways how's everything is written and done inside (though patches are surely welcome). That said, the recent changes made quite significant improvements in memory consumption and many large (meta-)genome graphs could be loaded on a laptop.

asl avatar Jun 02 '22 18:06 asl