BandageNG
BandageNG copied to clipboard
Support larger GFAs
Large genome / metagenome could generate very huge assembly graph. We need to see how to load them properly.
Few ideas straight from the head:
- Switch to GFA parser from https://github.com/lh3/gfatools Our experience show that it's pretty efficient.
- Some additional indexing could be done, so we won't need to keep the whole graph in the memory
The main problem is storage of the graph in memory. I would suggest switching to a library based on succinct data structures. Libhandlegraph and odgi both have suitable implementations. These libraries provide parsing that's plenty efficient. They also support path queries.
On Wed, Sep 22, 2021, 06:15 Anton Korobeynikov @.***> wrote:
Large genome / metagenome could generate very huge assembly graph. We need to see how to load them properly.
Few ideas straight from the head:
- Switch to GFA parser from https://github.com/lh3/gfatools Our experience show that it's pretty efficient.
- Some additional indexing could be done, so we won't need to keep the whole graph in the memory
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/asl/Bandage/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQENVNMFQH4ULGKS3X7LUDFJ6VANCNFSM5EQK2WCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Well, surprisingly enough, the internal storage of gfatools is quite efficient (as it's essentially the parsed GFA) - we do use gfatools for GFA parsing in SPAdes is it works quite well. One step at a time :)
It's not just about putting it in memory, but about providing efficient random access to its parts. The paths are particularly ruinous in this regard. You can have fast path iteration if you've just loaded the GFA into memory (you can iterate over the steps in a path), but it's ultra expensive to find all the paths at a given node.
A good interface for this is gbwtgraph: https://github.com/jltsiren/gbwtgraph into Bandage. That typically uses less memory than the GFA, and it'll provide random access, path iteration, and path pattern matching queries (e.g. subpath frequency).
That said, any dependency-light way to get the needed access patterns is good!
On Wed, Sep 22, 2021 at 8:10 AM Anton Korobeynikov @.***> wrote:
Well, surprisingly enough, the internal storage of gfatools is quite efficient (as it's essentially the parsed GFA) - we do use gfatools for GFA parsing in SPAdes is it works quite well. One step at a time :)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/asl/Bandage/issues/5#issuecomment-924616093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOVYACU4GZFKRKEULDUDFXM3ANCNFSM5EQK2WCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@ekg It seems that gbwtgraph is not quite suitable as a generic graph interface. Especially the following treatment of GFAs:
Overlaps, containments, and tags are ignored.
Links are induced by the paths; the tool ignores L-lines.
is certainly pangenome-centric and certainly are not suitable for generic GFAs including e.g. assembly graphs.
I see, you're totally right!
Alternatives are odgi and libhandlegraph. The PackedGraph model is especially efficient.
Sorry, specifically you'll want libbdsg: https://github.com/vgteam/libbdsg. libhandlegraph is an abstract interface in C++, not an implementation.
Adding overlap lengths to the link lines would be the missing part, otherwise I think it should be more directly applicable to generic GFAs that gbwtgraph.
#17 Replaced old regexp-based GFA parser with one from GFA tools. Significant speedup was achieved on this step
https://github.com/asl/Bandage/pull/33 Added stream-oriented GFA parser with almost zero copies in the most cases. As a result of all recent changes the memory consumption and runtime dropped 5x-10x depending on the graph
For an example of a library that uses vg's bidirected sequence graph (bdsg) library, and also maintains overlap data, you could check out https://github.com/vgteam/GetBlunted
bdsg has PackedGraph and HashGraph implementations.
I doubt Bandage is ready for huge change of internal representation due to ways how's everything is written and done inside (though patches are surely welcome). That said, the recent changes made quite significant improvements in memory consumption and many large (meta-)genome graphs could be loaded on a laptop.