libhandlegraph icon indicating copy to clipboard operation
libhandlegraph copied to clipboard

External iterators: for language bindings

Open JervenBolleman opened this issue 5 years ago • 12 comments

The current handlegraph api, uses the internal iterator approach (e.g. for_each_edge etc.) This is very inconvenient when trying to use this code from python. The only way to move data from inside the iteratee to the outside is by running the iteration in a different thread than the major python code and pass out the required information using a shared blocking queue. As you can imagine this is a pain to program.

The python approach, like modern java is to use generators/streams instead. This works well in these languages because memory ownership is well defined. For C++ I imagine this approach was taken to make it easier to keep track of who owns what memory. Unfortunately, the push approach inherent in this solution is a pain to use from languages which like to use a pull model.

Could we have standard C++ iterators for each iteratee method?

JervenBolleman avatar Oct 29 '19 07:10 JervenBolleman

+1

subwaystation avatar Oct 29 '19 10:10 subwaystation

e.g. in SpOdgi we do the following

def handles(self):
        nodeId = self.odgi.min_node_id()
        
        maxNodeId = self.odgi.max_node_id()
        while (nodeId <= maxNodeId):
            if(self.odgi.has_node(nodeId)):
                nodeId=nodeId+1 
                yield self.odgi.get_handle(nodeId-1)

Which is the only way to iterate over all nodes in lazy way. But less than ideal performance wise.

JervenBolleman avatar Nov 21 '19 17:11 JervenBolleman

There is for_each_handle that does the same thing. Is that not working?

On Thu, Nov 21, 2019, 18:59 JervenBolleman [email protected] wrote:

e.g. in SpOdgi we do the following

def handles(self): nodeId = self.odgi.min_node_id()

    maxNodeId = self.odgi.max_node_id()
    while (nodeId <= maxNodeId):
        if(self.odgi.has_node(nodeId)):
            nodeId=nodeId+1
            yield self.odgi.get_handle(nodeId-1)

Which is the only way to iterate over all nodes in lazy way. But less than ideal performance wise.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/libhandlegraph/issues/33?email_source=notifications&email_token=AABDQEPMWJ5WWIVXTP4EG4TQU3D7JA5CNFSM4JGEY6WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3DV5I#issuecomment-557202165, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIFRUK5CBW72HTVK6TQU3D7JANCNFSM4JGEY6WA .

ekg avatar Nov 21 '19 18:11 ekg

@ekg no because you can't jump yield from within the lamda generator style, at least not in python.

So you end up having to either collect all the nodes so you can iterate over them python style at humongous memory costs. Or one has a silly iterator implementation like above.

JervenBolleman avatar Nov 21 '19 18:11 JervenBolleman

So I came by today to open this same issue again :) it would be really nice to have std::iterator for edges,handles and paths. I think these iterators can be basic forward_only types

JervenBolleman avatar Jun 23 '22 08:06 JervenBolleman

Would it be sufficient if we implemented something like the scan_path method for edges?

https://github.com/vgteam/libhandlegraph/blob/master/src/include/handlegraph/path_handle_graph.hpp#L161-L164

jeizenga avatar Jun 23 '22 16:06 jeizenga

Yes, I really think so :100:

On Thu, Jun 23, 2022, 18:04 Jordan Eizenga @.***> wrote:

Would it be sufficient if we implemented something like the scan_path method for edges?

https://github.com/vgteam/libhandlegraph/blob/master/src/include/handlegraph/path_handle_graph.hpp#L161-L164

— Reply to this email directly, view it on GitHub https://github.com/vgteam/libhandlegraph/issues/33#issuecomment-1164600552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHQYFO4YQYKIY4ZTGQBEKDVQSDKRANCNFSM4JGEY6WA . You are receiving this because you authored the thread.Message ID: @.***>

JervenBolleman avatar Jun 23 '22 16:06 JervenBolleman

It's a tricky thing, because we haven't exposed an object that has knowledge about where it is in the container the way an iterator does (which the exception of step_handle_t). One easy option would be to copy the handles into a vector and then iterate over that. I think the biggest downside is that it would be pretty memory inefficient for any for-each loop that looped over a sizeable fraction of the graph (e.g. for_each_handle(), which does all nodes).

Thoughts?

jeizenga avatar Jun 27 '22 21:06 jeizenga

I know. Copying and then iterating is not an option beyond test cases. I really need something new here, sorry. I have one more horrid alternative and that is to spin up a thread and use a small shared queue to load items in a buffer and iterate over that. Still we can't use raii there and will need to free the copied objects.

JervenBolleman avatar Jun 28 '22 05:06 JervenBolleman

In the case of handles (and other stuff, probably), xg and PackedGraph store them in arrays. So a get_ith_handle() type method would be possible and efficienit (as would, by extension, supporting iterators that count i internally). I don't think you could get random access from HashGraph without some re-engineering, but I guess it could in theory expose its STL iterator (or some kind of wrapper for it).

glennhickey avatar Jun 28 '22 15:06 glennhickey

I just need a STL style iterator, no need for fast random access.

JervenBolleman avatar Jun 28 '22 19:06 JervenBolleman

I made a quick prototype of what a hack-y vector-based implementation could look like. Again, I have some concerns about efficiency for iterating over the entire graph, but this would probably work fine for traversing edges:

https://github.com/vgteam/libhandlegraph/blob/for-each/src/include/handlegraph/handle_graph.hpp#L256-L322

jeizenga avatar Jun 28 '22 20:06 jeizenga