janusgraph icon indicating copy to clipboard operation
janusgraph copied to clipboard

Speed up label count performance

Open rjurney opened this issue 7 years ago • 13 comments

Problem: Counting all vertices/edges by label is slow!

Users frequently need to count all nodes/edges of a type by label. This could be to verify data loading, to get acquainted with the data or as part of an equation. This is a frequent and essential operation, and yet the speed on graphs of any scale at all is incredibly slow, because all vertices/indexes must be traversed to do the counts.

Solution: Keep label counts as global metadata

To solve this problem and greatly improve the user experience, I propose to do what many OLAP databases do: keep track of counts for entire tables/labels as you go. I am willing to do the implementation, I just need pointers as to how this might work.

Questions

Along these lines, I have a few questions:

  • What metadata is currently stored about a graph globally and how is it accessed?
  • How might I update this global state for each query involving addEdge/addVertex and any other call impacting label counts.
  • How much of this is Gremlin and how much of it is JanusGraph?

Thanks!

rjurney avatar Feb 10 '18 00:02 rjurney

Hi, it's a good idea. For me you can do it in four differents ways :

  1. Keep a track in a 'master tables' as you proposed
  2. Use a fake property with the label then add a mixed indexes on it and implement the optimisation as proposed in https://github.com/JanusGraph/janusgraph/issues/874
  3. Use index as proposed in https://github.com/JanusGraph/janusgraph/issues/283 and add the optimisation as proposed in https://github.com/JanusGraph/janusgraph/issues/874
  4. Query directly backend for example with cql for Cassandra.

For me 1) will add to much lock latency and 4) I do not know if it's possible

With the fake property, you can do it with the current code. To have your information you can directly query search backend or use g.V().has('fakeLabel', 'label1').count(). If you want to speed up the gremlin query, you can implement https://github.com/JanusGraph/janusgraph/issues/874.

If you do not want to use a fakeproperty, you should implement https://github.com/JanusGraph/janusgraph/issues/283. A correct implementation of https://github.com/JanusGraph/janusgraph/issues/283 is not that easy but with this feature JanusGraph can optimize a lot of others queries.

davidclement90 avatar Feb 11 '18 15:02 davidclement90

With automatic schema generation, invisible schema vertices are also created in the graph. Could we add a count property to the schema vertex for the labels we want to count?

amcp avatar Feb 11 '18 15:02 amcp

I can understand the frustration on this one @rjurney. As you've probably seen, the recommended approach is to use TinkerPop OLAP to run any "analytical" queries including counts against large portions or all of a graph. That's all well and good but it's another piece of machinery to get up and running unless your graph is small enough you can start Spark up in standalone mode. Janus does have its own graph computer called Fulgora that you could try out for counts:

import org.janusgraph.graphdb.olap.computer.*
a = graph.traversal().withComputer(FulgoraGraphComputer.class)
a.V().count()

With all that said, I think this will be a tricky one to solve when it comes to the distributed backend options. I'd imagine a Berkeley specific implementation wouldn't be too bad and counts could be made accurate because they could be rolled into Berkeley's ACID transactions and everything is happening locally. On the Cassandra/Scylla side, unless we introduce more locking which would effectively serialize writes on a per label basis, we could maybe do something with counters, but from what I've heard about counters, I don't think anyone could trust the count to be anything more than approximately correct which doesn't help much with your data load verification use case.

As far as the mixed index approaches go @davidclement90 , do you know how well Elasticsearch/Solr handles incredibly low selectivity indexes like this case would lead to?

twilmes avatar Feb 11 '18 16:02 twilmes

In the indexing phase, Elasticsearch/Solr/Lucene do not care about low cardinality fields. For example, we you add a new Person vertex, for the Inverted index it's just another document for the value Person of the field label.

At search time, the biggest issue its conjunctions : "The way conjunctions work is by iterating over matches from the most selective clause and verifying whether other required clauses match too. This means that we need two operations to be fast in order for conjunctions to be fast as well:

  • iterating over all matches,
  • verifying whether a particular document matches.

" (https://www.elastic.co/blog/better-query-planning-for-range-queries-in-elasticsearch)

Lets take these two queries :

  1. All vertices Person => no conjunction, so Ok
  2. All vertices Person whose name is Smith => conjunctions so Elasticsearch/Solr/Lucene will search for all Smith (I assume that smith is more selective than Person) then check if the doc has for label Person. String are using doc_values, as you see in the benchmark of (https://www.elastic.co/blog/better-query-planning-for-range-queries-in-elasticsearch), doc_values performance are constant.

davidclement90 avatar Feb 11 '18 19:02 davidclement90

I found a solution for my project. It needs testing, but it may not be so costly, in regards to the size of our app.

We have indexes, including one which will be useful for counting (on a property called "objectType", which is how we want to count our vertices), but a very large number of vertices (hundreds of millions). Count with a has, or with Fulgora, times out on very large object type categories.

I will create a "janus_meta" table. Each row will represent a category of vertex we want to count. There will be only one value (the count). I will stock the added numbers in a static POJO. Then, at the end of the process, I will update the values in "janus_meta", by scanning the table, updating the numbers, and writing the update in HBase.

This is a temporary solution, which will allows us, if functional, to do counting on vertices added in the graph after the installation of this update.

CPorrot avatar Jun 29 '18 12:06 CPorrot

We are using ElasticSearch as an index storage. Basically, for counting something we are using a specific index. An example is:

janusGraph.indexQuery("my_index", "v.\"my_property\":>=0").vertexTotals().intValue();

porunov avatar Jun 29 '18 18:06 porunov

Thanks for the tip. In the case of my project, since our graph only uses composite indexes, we only use a storage backend, HBase, at the moment.

CPorrot avatar Jul 02 '18 08:07 CPorrot

@twilmes For me,

import org.janusgraph.graphdb.olap.computer.*
a = graph.traversal().withComputer(FulgoraGraphComputer.class)
a.V().count()

works even slower then g.V().count

canbax avatar Jan 31 '22 08:01 canbax

@canbax How "slower" is it? Is it significantly slower or the difference is very minimal?

Btw there is a problem with this approach, which will be fixed in the 1.0.0 release (https://github.com/JanusGraph/janusgraph/pull/2965).

Right now my suggestion is to use SparkGraphComputer rather than FulgoraGraphComputer.

Here is an example using Cassandra:

graph = GraphFactory.open("conf/hadoop-graph/read-cql.properties")
g = graph.traversal().withComputer(SparkGraphComputer.class)
g.V().count()

li-boxuan avatar Feb 01 '22 05:02 li-boxuan

@li-boxuan I didn't measure the time. I executed it from the gremlin shell and just waited for it. It seems like SparkGraphComputer is also slower. It takes 47 seconds to count 197045 vertices.

With

graph = JanusGraphFactory.open('conf/janusgraph-cql-es.properties')
g = graph.traversal()
g.V().count()

takes 11.5 seconds.

My problem is not just counting by type. I want to count the number of results for EVERY query. This is necessary to know the number of data in the backend. I want to say the user "I'm showing you 10, but there are X in the backend"

canbax avatar Feb 01 '22 05:02 canbax

@canbax SparkGraphComputer launches a Spark instance that has overhead. It is more suitable for OLAP queries rather than OLTP queries.

In your case, you might want to build JanusGraph by yourself on top of https://github.com/JanusGraph/janusgraph/pull/2965.

Then you can continue doing

a = graph.traversal().withComputer() // equivalent to a = graph.traversal().withComputer(FulgoraGraphComputer.class)
a.V().count()

li-boxuan avatar Feb 01 '22 14:02 li-boxuan

@li-boxuan Thank you. a = graph.traversal().withComputer() was still slower. I stop searching for an answer. Instead, I used the limit statement in my gremlin queries. I think there is no solution for my case because I want to get the count for all my queries. My queries are dynamically generated with multiple conditions

canbax avatar Feb 02 '22 08:02 canbax

@canbax graph.traversal().withComputer() is equivalent to graph.traversal().withComputer(FulgoraGraphComputer.class), so if you are using an officially released JanusGraph version, then the problem mentioned in #2965 exists. I think it is still worth trying by building JanusGraph on your own on top of #2965. It's up to you but in case you need help with that, let me know.

li-boxuan avatar Feb 02 '22 13:02 li-boxuan