Best practices for indices in Titan
My understanding is that Titan only supports single a global collection of indices ("key indices") as opposed to a set of named manual indices. I noticed an example of how to use these in bulbs (https://gist.github.com/espeed/3938820) but what is the best practice for when a model property is named the same as another model?
For example if I have two models Customer and Business both of which has a property telephone_num, ideally I'd like to be able to only query the customers : g.build_proxy(Customer).index.lookup(telephone_num="0234928743") without having to do any subsequent type filtering. I've tried the example above with a 'name' property that is shared between my models. If a customer and business have the same name, the result contains a mixture of customer and business objects.
I could prefix the property names with the model name, but I'd like to avoid that if possible.
I'm reluctant to significantly modify bulbs myself to do this but am open to contributing if it won't be too time consuming. How about some python magic that overrides create(), save(), index.lookup() etc. to prepend every model property name with the model's name(e.g. customer:telephone_num, business:telephone_num) but hides these details from the query and model definition level? My guess is that you'd prefer to do this inside the titan client class. Also, given that key indices have to be created upfront, a script that inspects the models and creates all relevant key indexes might be a good addition.
In the meantime, as a quick fix, do you have any suggestions about how to progress with using bulbs and titan in their current state?
I realise you've already suggested the prefix-key idea before (https://groups.google.com/forum/#!msg/gremlin-users/msZRvK_9bAk/K4TqIRLUG3YJ), I'm curious to know if you're still considering it or if you've decided to abandon that idea.
Do you think bulbs should stop you from creating separate model-specific index proxies when using Titan (e.g. with g.build_proxy(MyModel) or g.add_proxy("my_models", MyModel)) since they dont mean really anything? Given that g.build_proxy(Customer).index.lookup(name='myname') and g.build_proxy(Business_.index.lookup(name='myname') return the same thing, it might be a good idea to prohibit creating these separate index proxies when using Titan, because its a little misleading.
Funny thing, I was wondering the same thing this week end.
I figured out that building an index on keys that have a prefix seems quite a good idea:
- It looks like very much a good-old SQL index on a table. And in most cases, you can only use one index at a time for a query, so
g.V("Customer_name", "Joe")seems a good way to start. - You can start with having all your nodes with regular properties. If you figure out you need an index, it's not too late, just run a script to duplicate the original
"name"property into a"Customer_name"property for eachCustomernode - You can probably handle that at the OGM level. Actually it's a patch I am working on for bulbs (based on the indexed=True property).
Hope it helps,
Cheers,
Thanks for the response adurieux, I think I'm going to go for using prefixes like 'customer_name' on my model properties.
As for my earlier comment about model proxies being unecessary in the context of Titan, at the very least they do seem to be necessary for creating model instances. Doing a lookup (with: graph.vertices.index.lookup(customer_name='blah')) returns a collection of Customer instances if they have been been created with graph.build_proxy(Customer).create(customer_name='blah) but returns a collection of Vertex instances when they have been created with graph.vertices.create(customer_name='blah').
On the other hand, I don't see any difference between graph.build_proxy(Customer).index.lookup(customer_name='blah') and graph.vertices.index.lookup(customer_name='blah') because they will both return a mixture of types if multiple models contain a property named 'customer_name'. Maybe the former should perform some filtering based on type?
Example:
cust = graph.vertices.create(customer_name='mycust') graph.build_proxy(Customer).create(customer_name='mycust') retrieved_custs = list(graph.vertices.index.lookup(customer_name='mycust')) # returns a Customer and a Vertex retrieved_custs2 = list(graph.build_proxy(Customer).index.lookup(customer_name='mycust')) # same as above retrieved_custs3 = list(graph.build_proxy(Business).index.lookup(customer_name='mycust')) # Business doesn't have this property, but the lookup still goes ahead. This also returns a list with a Customer and a Vertex.
I ended up using unique names for all properties.
To create the indexes, originally I thought it would be sufficient to call graph.createKeyIndex('property_name') for each property before adding any data to the graph and then using bulbs as normal. This didn't work for me. So instead I defined the types like this:
g.makeType().name("job_name").dataType(String.class).indexed(Vertex.class).unique(Direction.OUT).makePropertyKey()
Where "job_name" is a property on one of my bulbs model classes: def Job(Node): job_name = String(nullable=False, indexed=True) #not sure is necessary here
I'm going to write a simple script that inspects my model files for all properties and create an indexed type for each.
unique(Direction.OUT) seems to be required for it to work, as far as I know this specifies that vertices can have at most one of this property, this used to be called functional() in Titan 0.2. This isn't the same as unique=True in bulbs which is more like unique(Direction.IN) I think. In vs Out unique is briefly explained here: https://groups.google.com/forum/#!msg/aureliusgraphs/7dqadA3si6U/t89ORli0ZdMJ
The cleanest, most future-proof way may be to use Gremlin constructs such as...
g.V(name,value).has('element_type','person')
g.V('element_type','person').has(name,value)
But .has() involves iterating over a collection and testing each one.
I don't know how future-proof my approach is but creating the indices beforehand in Groovy, they seem to work in bulbs. So I can call:
g.people.index.lookup(person_name='john') and this seems to use the index if I've already defined one called 'person_name'.
there is also g.V(name,value).group() which is new, and I think it's close to compound keys;