cassovary
cassovary copied to clipboard
Create labeled graph in com.twitter.cassovary.graph
Right now, the graph data structure can not have labels on nodes or edges. Allow arbitrary labels in a new class. As a special case, allow only one int to be associated with each edge and each node.
Hi, I'd be interested in this feature. I'm looking for options to implement large graph traversals in an RDF setting which requires that edges have labels, however cassovary not being able to handle this is a deal breaker.
How much effort would it require to develop this feature? I'm assuming I'd be the one working on it unless there are other plans. To be honest, I haven't worked with scala before, however I think I can pick it up pretty easily.
Hi schwarzmx, I submitted a pull request, which includes support for annotating each node with a label, two days ago. Feel free to try it out and comment on whether it meets your needs. I'm likely to follow it up with several more labeling features and examples soon. https://github.com/twitter/cassovary/pull/39
Great! I hadn't noticed. I will check it out and let you know how it goes.
Thanks, Fernando
Fernando, I thought a little more about your interest in edge labels. Edge labels are not implemented in my pull request and they're not on my current development horizon. If the project admins like my refactoring of parsing code (remains to be seen), then here's how I might suggest implementing edge labels. Similar to what I did for vertex reading, you could pass a new EdgeReaderFactory to AdjacencyListGraphReaders' OneShardReader's constructor, and refactor OneShardReader's next() method to pass lines.next.trim to an EdgeReader. A new LabeledEdgeReader class could handle reading an optional label identifier after each listed edge id. A SimpleEdgeReader class could be used as the default, to preserve existing unlabeled edge behavior. Additional code would be needed to store the edge labels, for both the shared array and non-shared array case. One way would be to store them would be in parallel arrays to the neighbor ids. I think project admins will consider it important that edge label support doesn't impose (more than negligible) extra memory overhead when it's not used.
David, thanks for following up, I appreciate it. I'll see what the project admins decide on this and then see what I can do for the edge labels.
I still need to look at Dave's pull request, but his idea of implementing edge labels as parallel arrays to neighbor ids sounds good.
On Nov 4, 2013, at 8:27 PM, Fernando Hernandez [email protected] wrote:
David, thanks for following up, I appreciate it. I'll see what the project admins decide on this and then see what I can do for the edge labels.
— Reply to this email directly or view it on GitHub.
Hi guys, what about this issue?
Hi, sorry for not coming back to this. I had to stop working on this for other reasons. I don't think I will be working on it in the near future.
I need to think more about this but it seems like this will require a reorganization of the code, to define separate types for vertex and edge and then a graph in terms of those types. This will make it flexible to define labeled vertices and edges. One will need to be careful though so as to be space efficient in the specialized case of unlabeled edges (the case as of now, which makes Cassovary space efficient).
I don't think making types of nodes and edges very general is a good direction. I've used scala-graph.org and it becomes slow for 20K nodes and is hard to learn if you don't know scala type system very well (try writing a method that takes a undirected graph with nodes of given type and returns same graph weighted with nodes of the same type and random weights or extending a graph).
Maybe as with the nodeRenumberer that can take Long or String, we can have edgeRenumberer Int <-> Long/String and use parallel edges that keep only ints? This limits the number of different edge labels to ~2*MAX_INT, but I think in most applications it is not a problem.
What you are saying makes sense in that making it general is likely to make it much less efficient and less readable/extendable. However, I am also interested in enabling richer functionality by users (such as @schwarzmx above). Another use case that comes up often in my experience is to add a weight to every edge (ideally weight is of type 'double', but the type could be specialized to int/short/byte). Currently this is not possible in Cassovary.
I like the general idea of encoding nodes and edges in ids and allowing richer data to be stored indexed by those ids. To start with, what do you think about two separate changes:
- Allowing a facility for nodes to carry arbitrary data by storing that data indexed by internal node ids. This is separate from allowing external node ids to be Long or String which is purely for the convenience of the user.
- Allowing a facility for long/int/short/byte -- this should be possible using specialized -- edge ids to be stored in a parallel array to neighbor node ids. These edge ids could be either directly used (such as for keeping weights) or be indexed into separate collection of edge data.