arangodb-tinkerpop-provider icon indicating copy to clipboard operation
arangodb-tinkerpop-provider copied to clipboard

Properties are not reflected inside Collections

Open a-marcel opened this issue 5 years ago • 24 comments

Hello,

if i add a Vertex like mentioned in the readme:

Vertex v2 = g.addV("software").property("name", "lop").property("lang", "java").next();

this properties are not visible inside the collection "graph_software", only inside the collection "graph_ELEMENT-PROPERTIES".

Marcel

a-marcel avatar Jan 04 '19 11:01 a-marcel

Hi,

This is by design in the current version. The decision was made based on the design of the Tinkerpop API, in which properties are treated almost the same as nodes (i.e. properties can have properties). When you add this to how VelociPack serialises classes into the DB, this nested JSON structures become a bit harder to handle in order to query and retrieve properties. having properties stored in a separate collection makes handling vertex properties and properties' properties a lot easier.

arcanefoam avatar Jan 04 '19 14:01 arcanefoam

I am currently thinking on how we could store the vertex properties in the vertex document (as well as the properties' properties) and make it work.

arcanefoam avatar Jan 04 '19 14:01 arcanefoam

For me it's fine and i understand that it's not easy to store the properties -> properties. I just was surprised to see an "{}" inside the collection. But this will make it a little bit complicated to find vertex/documents with the web UI if you just use the graph support from ArangoDB.

OrientDB as an example show the properties inside their Document Database, but even there i don't know if they support nested Properties.

Marcel

a-marcel avatar Jan 04 '19 14:01 a-marcel

Yes, I agree this makes it harder to "look" at the DB and know if Tinkerpop is behaving as expected. That said, I was not considering this case when developing this, i.e. I expected users to stay in the Tinkerpop world. For looking for properties from the Arango UI you would need to do it via AQL queries, using the graph api:

for e, n in graph....

I could add some examples to the wiki for users that want to double check their DB.

arcanefoam avatar Jan 04 '19 14:01 arcanefoam

Perhaps, we could store the Vertex properties in the collection and then use the separate node approach for properties properties (probably using an edge label to indicate the vertex's property the property belongs to). This will solve the nested properties issue, allow us to use indexes and make it easier to see vertex properties in the DB.

arcanefoam avatar Jan 04 '19 14:01 arcanefoam

Yes, your're right. I'll stay inside the thinkerpop world - but you have such a nice UI ... why not use this for debugging / finding usage.

a-marcel avatar Jan 04 '19 15:01 a-marcel

For debugging I would strongly recommend using the Tinkerpop console. The driver works with it, so you can test your Tinkerpop queries there before adding them to your Java implementation.

arcanefoam avatar Jan 04 '19 15:01 arcanefoam

@arcanefoam Have you thought the final solution of how to store the properties inside the Vertex/Edge item directly? Also, what kind of functionality must the properties have to support TinkerPop API other than that subproperties are supported? I was thinking I might help here as well, because we heavily use ArangoDB, however for new UI we are developing for our app, we would leverage more of the Graph API (such as Tinkerpop Gremlin language) rather than still repeating AQL queries. We will have UI for creating Graph queries by user. Our current implementation have all properties of an edge (e.g. startDate, endDate) or properties of a vertex at the documents itseld. We usually have around 20-40 collections with pretty dense graph and I am a bit worried that adding this extra ELEMENT-PROPERTIES collection can make the performance a little worse not to mention that we would have to rewrite all AQL queries (e.g. for some cron jobs, import data jobs etc...). The Graph API we want to use is just for Front End, the rest modules using the Arango will still use the standard AQL and API (e.g. Java Driver using transactions..).

fdominik avatar Jan 20 '19 19:01 fdominik

So far this is what I have envisioned:

Perhaps, we could store the Vertex properties in the collection and then use the separate node approach for properties properties (probably using an edge label to indicate the vertex's property the property belongs to). This will solve the nested properties issue, allow us to use indexes and make it easier to see vertex properties in the DB.

The tricky part of this approach is that Tinkerpop expects perfect type match when set->store->retrieve, that is, if I set g.V(100).property("size", 1.0), when you read the property it is expecting to see a Double. However, the way arango works it actually returns an Integer. Technically there is no difference in the value, but this is something that gave me a lot of headaches when running the Tinkerpop test suite (originally I stored properties and properties' properties in the same document).

One option is to provide our own VelociPack serializer/deserializer that adds this type information for each attribute, e.g. an extra attribute size_type (or something). When the user selects the new config flag to reuse existing Graph data this type storage/retrieval could be skipped.

That being said, my main concern with this implementation is to make as fully compliant to Tinkerpop as possible. So although your scenario is legit, any solution will have to also work with the Tinkerpop test suite.

arcanefoam avatar Jan 20 '19 20:01 arcanefoam

Hm I was thinking more of storing the property similar like the GraphSON structure. Because Arango also uses "JSON", so we could be inspired by GraphSON and use kind of "our own system" fields...

[{
	"_key": "p2",
	"_id": "person/p2",
	"name": {
		"@Value": "Dominik",
		"@Type": "java.lang.String"
	},
	"age": [{
            "id": {
                "@type": "java.lang.Integer",
                "@value": 1
            },
            "value": {
                "@type": "java.lang.Integer",
                "@value": 29
            }
        }]
}]

More is at http://tinkerpop.apache.org/docs/3.3.3/dev/io/#_vertex_3 What would you think of such?

fdominik avatar Jan 20 '19 22:01 fdominik

That is an interesting idea, I am guessing you are already using something similar in your app? The only question that remains is if this way we can still use properties, e.g. name, age, as indexes. Would need to double check if Arango works with this structure, cause Tinkerpop supports indexes and moving the properties back to the node/edge document should also consider providing support for this (currently not supported). So to your question:

Also, what kind of functionality must the properties have to support TinkerPop API other than that subproperties are supported?

We would definitely want properties to be indexable.

arcanefoam avatar Jan 20 '19 23:01 arcanefoam

So I tried the indexing... I created a FULLTEXT index in ArangoDB using the field: name.@value

and then using AQL I was able to use the fulltext index:

for a IN FULLTEXT(person, "name.@value", "Dominik")
return a

Is this what it is supposed to do or is there something more I am missing? How do I test a property on property, is it something like:

{"address":{
 "street":"A",
 "country":{
 "code": "cz",
 "name ": "Czech Republic"
}
}}

Because if it is something like address.country.code, it is very easy to add the index, you just add address.country.code.@value in the index definition. Or how would the properties be represented? Would they be in the same object? I am probably missing the Gremlin interpretation of the property on property... I can't figure out which Step would I need to call in order to create an object with nested properties + what Step would I need to retrieve that particular property and do some filtering on it.

fdominik avatar Jan 28 '19 03:01 fdominik

Hi, Indeed it is easy to create indexes in Arango, and nested properties are also supported.

To answer your question

I can't figure out which Step would I need to call in order to create an object with nested properties

When you add properties, e.g. via g.V().outE('knows').addV().property('name','nothing') the API of the property method actually accepts a varg of "keyValues": default GraphTraversal<S,E> property(Object key, Object value, Object... keyValues). These keyValues should be an even number of arguments that represent name-value pairs of property's property, e.g.

g.V().outE('knows').addV().property('name','nothing', 'size', 10, 'empty', false)

adds the size and empty properties to the 'name' property.

  • what Step would I need to retrieve that particular property and do some filtering on it.

Since properties' properties work very similar to vertex properties, once you retrieve a property, you can use the has filters on it, e.g.

g.V(1).properties('name').has('size', 10)...

In both cases its more of a problem on how we represent vertex properties and properties' properties in the Java implementation. Currently both are represented as individual nodes in the ELEMENT_PROPERTIES collection (linked via ELEMENT_HAS_PROPERTY edges). This design does not allow properties to be indexed. Ideally both vertex properties and properties properties should be stored in the vertex, so they are serialised into a single ArangoDB document and then indexable. The design consideration here is how to store them in the vertex in Java. First thing that comes to mind is a Map of : entries.... but then we would also probably need a map of :<JavaType> and another of :Mapname:value for properties properties... and then if we want to keep the properties properties types too....

Plus, we have to decide in either a structure "simple" enough so the default velocipack serialisation/deserialisation works for us or something a bit more efficient at runtime but for which we would need a custom velocipak serializer (e.g. so that the documents in the DB look like GraphSON).

arcanefoam avatar Jan 28 '19 10:01 arcanefoam

So I have created a new branch in my fork of this repo: https://github.com/fdominik/arangodb-tinkerpop-provider/commits/feature_PropertiesInsideElements where I published the draft. It is not working yet, it should serve more for the discussion here, because writing a code in a comment isn't really convenient :) So I've a ArangoTestvertex which gets serialized by Vpack to this structure:

{
	"@properties": {
		"name": {
			"@properties": {
				"@type": "String",
				"@value": "John"
			},
			"_id": "1"
		},
		"age": {
			"@properties2": [{
				"age": {
					"@properties": {
						"@type": "Integer",
						"@value": 29
					},
					"_id": "1"
				}
			},
			{
				"age": {
					"@properties": {
						"@type": "Integer",
						"@value": 18
					},
					"_id": "2"
				}
			}]
		}
	},
	"_id": "P\/1",
	"_key": "1",
	"_rev": "AXZ"
}

So it isn't the final structure, but as an example it can serve. Take a look at VPackTestOfNestedProperties Test Class, which has the serializer from POJO to VPack and then transforming to JSON using VPackParser...

fdominik avatar Jan 30 '19 01:01 fdominik

Hi, I've been working on this too and came with a somewhat similar idea. What started to trouble me after the fact, is that this is not really compatible with an all and any existing document structures. I think for the most part we will find people using a simple JSON structure to save their data (this stems of the fact that they will simply let velocipack do the standard serialization (In this case I can imagine address was serialized from a java class, as well as each hobby):

{
  "_id" : "myusers/3456789",
  "_key" : "3456789",
  "_rev" : "14253647",
  "firstName" : "John",
  "lastName" : "Doe",
  "address" : {
    "street" : "Road To Nowhere 1",
    "city" : "Gotham"
  },
  "hobbies" : [
    {name: "swimming", howFavorite: 10},
    {name: "biking", howFavorite: 6},
    {name: "programming", howFavorite: 4}
  ]
}

Lets imagine what we would need to store just the name information, so it works with Tinkerpop:

{
  "_id" : "myusers/3456789",
  "_key" : "3456789",
  "_rev" : "14253647",
  "firstName" : {
      "value": "John",
      "cardinality": "single",
      "type": "lang.java.String",
      "properties": [ { } ] 
  },
  "lastName" : {
      "value": "Dow",
      "cardinality": "single",
      "type": "lang.java.String",
      "properties": [ { } ]
  }
}

Things get trickier with attributes that have multiple values because for each value you can have a different set of nested properties. Lets simplify the hobbies and just keep the hobby name as value and use the howFavorite as a nested property:

{
  ...
  "hobbies" : {
      "value": ["swimming", "biking", "programming"],
      "cardinality": "list",
      "type": ["String", "String", "String"],
      "properties": [
          {"howFavorite": {
              "value": 10,
              "type": "Integer"}
          },
          {"howFavorite": {
              "value": 6,
              "type": "Integer"}
          },
          {"howFavorite": {
              "value": 10,
              "type": "Integer"},
           "since":  {
              "value": 1996,
              "type": "Integer"}
          },
      ]
  ]
}

For persistence I am using position to match types and nested properties. The point I am trying to make is that this complex structures are going to be very likely different to what most people have in their existing documents, making the reuse of existing graphs very difficult.

As I see it we have two approaches.

  1. Use our own complex structure, which makes reusing existing collections impossible. In this case, we could allow (somehow?) for users to provide their own velocipack serializers so that they can reuse what ever structure they have in their documents.

  2. Rethink our structure so the Tinkerpop metadata is stored separately in the document. This way, we can read a simple JSON and infer the missing information, e.g. use whatever types velocipack chooses, for attributes with many values we can pick the list cardinality, etc. Note that in this basic JSON structure vertex properties' properties can not exist (i.e. addresses and hobbies in the previous example would be stores as object values). When the "reuse existing collections" flag is set, we could also disable storing the Tinkerpop metadata, so we do not add "noise" to existing data.

I am more inclined towards 2, as it makes it easier for users to reuse their existing data. Moving the Tinkerpop metadata to a separate partition would look like:

{
  "_id" : "myusers/3456789",
  "_key" : "3456789",
  "_rev" : "14253647",
  "firstName" : "John",
  "lastName" : "Doe",
  "address" : {
    "street" : "Road To Nowhere 1",
    "city" : "Gotham"
  },
  "hobbies" : [
    {name: "swimming", howFavorite: 10},
    {name: "biking", howFavorite: 6},
    {name: "programming", howFavorite: 4}
  ]
  "!tinkerpop": {
    "firstName" : {
      "cardinality": "single",
      "type": "lang.java.String",
      "properties": [ { } ] 
      },
      "lastName" : {
        "cardinality": "single",
        "type": "lang.java.String",
        "properties": [ { } ]
      },
      "address" : {
        "cardinality": "single",
        "type": "Object",
        "properties": [ { } ]
      },
      "hobbies" : {
        "cardinality": "list",
        "type": ["String", "String", "String"],
        "properties": [
          { },
          { },
          {"since":  {
              "value": 1996,
              "type": "Integer"}
          },
      ]
}

Note that I have used an exclamation mark to prefix the tinkerpop section because "@" is used in AQL and could cause issues when building the queries we use for accessing the DB.

arcanefoam avatar Jan 30 '19 10:01 arcanefoam

Wow, the option 2 is indeed much better. I was worried the same as you are, that we would build still a new structure which would not fit the existing documents. The option you thought of seems more reasonable...

In the mean time, there is one more topic to discuss. Today I have did some more research on how ArangoDB tinkerpop driver creates the AQL queries and just found something, maybe a little, disappointing? E.g. when I do: g.V().has("person"), which is just give me "all persons", it first gets ALL vertices from ALL collections and these get filtered later in Gremlin to only persons.

And this goes even further. If you want properties of properties and you would have 5 persons, these following AQL queries would be executed separately:

  1. get all queries = 1 query
  2. for each person found (let's assume 5 persons found), get it's properties = 5 queries
  3. for each matched property, get nested properties = (if each 5 person has the property) 5 queries.

Making total of 11 queries just to do something like: g.V().properties("name").has(size, 10)

I was really negatively surprised by this inefficiency, because it can be written in a single and FAST performing AQL.

So our aim would have to be also to rewrite the querying a little... I didn't take a look on other queries if e.g. edges are there, I was now studying only the properties part of it... For example if the query starts with g.V().has("something"), we should

fdominik avatar Jan 30 '19 10:01 fdominik

Can you please open a separate bug for discussing query optimisation? That way we can keep discussions more organised (I will delete these posts as soon as the new bug is open).

arcanefoam avatar Jan 30 '19 10:01 arcanefoam

You can find an initial attempt at this on branch feature/38. There are a lot of things to consider...

  1. Since properties are now part of the vertex, doing g.V() will result in all the vertices and its properties being loaded into memory... the never-ending performance vs. memory wars...
  2. I am using streams to workaround casting... however this makes the properties() returned Iterators not suitable for removal while iterating...
  3. Since all vertex information is loaded into memory we need to implement some caching so we can synchronise. E.g. (from VertexPropertyTests.shouldRemoveMultiPropertiesWhenVerticesAreRemoved test)
final Vertex marko = graph.addVertex("name", "marko", "name", "okram");
for (int i = 0; i < 100; i++) {
  marko.property(VertexProperty.Cardinality.list, "name", "Remove-" + String.valueOf(i));
}
// Remove all "name" properties that start with "Remove-"
g.V().properties("name").has(T.value, P.test((a, b) -> ((String) a).startsWith((String) b), "Remove-")).forEachRemaining(Property::remove);
assertEquals(2, IteratorUtils.count(marko.properties("name")));

In this case g.V() goes to the DB and effectively loads and modifies a new copy of the marko vertex. Since marko was in memory the assertion fails because that copy is stale. So g.V() should do some smart caching.... which brings me to issue 1.... which is tricky to implement.

A complete implementation of this new approach to properties opens some cans of worms and might take a while to get right, IMHO.

arcanefoam avatar Feb 01 '19 09:02 arcanefoam

Amazing work Horacio. I was taking a look on it. Some things work, but still some bugs could be found. I would be interested if I should create a new issue for these or write as a comment here... And also, I am completely lost now in the error...

E.g. I did a quick test, creating 2 entities using:

Vertex v2 = g.addV("person").property("name", "lop").property("lang", "java").next();
Vertex v3 = g.addV("person").property("name", "nothing", "size", 10, "empty", false).next();

And then I searching them using:

g.V().hasLabel("person").properties("name").has("nothing").toList();

This one doesn't work and fails (even if I try to get the simpler vertex 'lop'). However if I try to get them using .next();, then it works. The StacKTraces:

com.arangodb.ArangoDBException: com.arangodb.velocypack.exception.VPackParserException: com.arangodb.velocypack.exception.VPackValueTypeException: Expecting type OBJECT

	at com.arangodb.internal.util.ArangoDeserializerImpl.deserialize(ArangoDeserializerImpl.java:59)
	at com.arangodb.internal.util.DefaultArangoSerialization.deserialize(DefaultArangoSerialization.java:58)
	at com.arangodb.internal.cursor.ArangoCursorIterator.deserialize(ArangoCursorIterator.java:79)
	at com.arangodb.internal.cursor.ArangoCursorIterator.next(ArangoCursorIterator.java:75)
	at com.arangodb.internal.cursor.ArangoCursorImpl.next(ArangoCursorImpl.java:112)
	at com.arangodb.tinkerpop.gremlin.client.ArangoDBIterator.next(ArangoDBIterator.java:56)
Caused by: com.arangodb.velocypack.exception.VPackParserException: com.arangodb.velocypack.exception.VPackValueTypeException: Expecting type OBJECT
	at com.arangodb.velocypack.VPack.deserialize(VPack.java:417)
	at com.arangodb.internal.util.ArangoDeserializerImpl.deserialize(ArangoDeserializerImpl.java:55)
	... 26 more
Caused by: com.arangodb.velocypack.exception.VPackValueTypeException: Expecting type OBJECT
	at com.arangodb.velocypack.VPackSlice.objectIterator(VPackSlice.java:772)
	at com.arangodb.tinkerpop.gremlin.client.ArangoDBVertexVPack.deserialize(ArangoDBVertexVPack.java:201)
	at com.arangodb.tinkerpop.gremlin.client.ArangoDBVertexVPack.deserialize(ArangoDBVertexVPack.java:102)
	at com.arangodb.velocypack.VPack.getValue(VPack.java:578)
	at com.arangodb.velocypack.VPack.deserialize(VPack.java:415)

I was trying to find out what is wrong, but really wasn't successfull. I found out that it fails in VPackSlice class at line 677:

if ((long)index >= n) {
        throw new IndexOutOfBoundsException();
      }

But I don't have a clue what the method getNthOffset(int index) should do 😄

fdominik avatar Feb 06 '19 00:02 fdominik

Regarding your issue of g.V() going to the memory with all the vertices and properties. This is connected to more generic performance issue of how the ArangoDB Tinkerpop implementation now creates the AQL. I've been thinking of it for a while...

I would suggest that we do more like bottom-up parser, which could e.g. take a look on the whole query and see what are the second-level operators (V() is the first level) and if they somehow filter. If they do, the V() wouldn't load all vertices + properties, but rather apply these second-level operators immediately and thus reducing the amount of returning vertices.

However if after V() there is no operator which eliminates the vertices, but some other (maybe some like .toList()) than it would proceed with loading all vertices.

Of course, that best approach would be to create a Grammar for Tinkerpop gremlin and Syntax parser of it, which would then be able to create AQL queries much easily.. But Gremlin is quite complex language and creating the grammar would take some significant time. I found some old scholar article on this: https://arxiv.org/pdf/1508.03843.pdf

fdominik avatar Feb 06 '19 00:02 fdominik

Amazing work Horacio. I was taking a look on it. Some things work, but still some bugs could be found.

I have been running the test suite more thoroughly and been fixes some bugs here and there. I will let you know when it is at some decent stage for more testing.

I am a bit lost at how to divide this into more meaningful bugs so for the time being I will continue to address the g.V() issues as part of the aftermath of the changes introduced by #38.

One thing you need to consider is that as you mentioned, we have little power on how the gremlin statements get implemented, i.e. we can not (easily) change the hasLabel() implementation. What we need to focus on is that the root of all gremlin queries is an iterator -> g.V() returns an iterator and then each of its elements gets processed by the "pipe" of your statement. So what we need to do is make sure those iterators are very smart :O.

arcanefoam avatar Feb 06 '19 10:02 arcanefoam

Thanks for all your work on this driver, is this enhancement still going ahead?

tkimmm avatar Nov 08 '20 09:11 tkimmm

Hi @tkimmm, shortly after work started on this feature I changed jobs and I can no longer spend work ours working on the driver. The change also meant I found my self with less free time. However, I am starting to get organized with my OSS development and I expect that I will be able to commit some of my own time to completing this feature.

I am also happy to give pointers if anyone really wants this and would like to take over the feature development. I think 90% of the work is done.

arcanefoam avatar Nov 18 '20 13:11 arcanefoam

I'd love to help but unfortunately I am not a Java developer. I can understand, life happens. At this point I'm using python to insert objects into the database but still need to configure the driver to recognise existing collections, in the .properties file. Haven't seen much success at the moment unfortunately.

tkimmm avatar Nov 24 '20 14:11 tkimmm