janusgraph
janusgraph copied to clipboard
Support String-type custom vertex ID
Though #147 is merged, we still need to call graph.getIDManager().toVertexId(custom_id) to explicitly convert a custom vertex ID to a valid Janusgraph vertex ID. I feel it's unfriendly for the end users. For example, when I use gremlinpython library with Janus, I cannot get the IDManager instance from the Python script.
Looking at https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-differences.html, I wonder if it's possible to have a similar String-type custom vertex ID as Neptune, where users can provide any string as the vertex ID, and JanusGraph will automatically hash the string and convert it to a Long type internal ID. We don't even need to expose the Long type internal ID to the end users.
Are there any blockers preventing us to do so?
I would like to notice that this is what stops me from migrating to JanusGraph.
Are there any updates on this issue? Has anybody looked into this any further? We would really like to be able to use UUID-s, for example.
Same here. I've been using Janusgraph for local development with deployment to aws neptune. Neptune supports only String id values.
This does not seem to be too complicated. I am planning to work on this recently.
It is not technically impossible - in fact, a working implementation is easy. However, it might not work as you expected.
JanusGraph expects the long id to be within the range [1, 36028797018963968), if you are using the default setting for cluster.max-partitions. 36028797018963968 looks like a very large number, but if you consider the mapping from alphanumeric to numeric, this is not that large.
Let's say we want to use digits and lower-case alphabets as custom vertex id. That is 10 + 26 = 36 characters. Now let's consider converting this alphanumeric string to a number. This has to be one-to-one mapping to avoid duplicates. One handy way is to consider the dictionary of the 36 alphanumeric characters as a new base. Then our goal is to convert a "number" from base36 to base10.
For example, let's say, our dictionary is: "0123456789abcdefghijklmnopqrstuvwxyz". Then for a custom vertexId "vid", the converted long value would be:
id = 31 * 36 * 36 + 18 * 36 + 13 = 40837.
We can see how a short 3-character string gets converted to a long value with 5 digits.
If our custom vertexId is "myvid98134", the converted long value would be 597172750407467008 which is already close to the upper bound. If our custom vertexId is "myvid981340", then we will exceed the upper bound thus get rejected.
As you can see, this is very different from Amazon neptune which seemingly allows a long string id. JanusGraph, however, due to its design, cannot support a long string id.
Btw, If anyone is interested, it is just one-liner code:
long id = graph.getIDManager().toVertexId(LongEncoding.decode("myvid001"))
We can integrate this into JanusGraph if there is still interest, after understanding the technical limitation of the possible string id support.
@li-boxuan why can't JanusGraph be redesigned to work like Amazon neptune with long string ids?
What were the initial design decisions to choose long ids instead of long string ids?
Can't it work with cassandra and support long string ids?
I am not creator of JanusGraph (or Titan) so I don't know how and why the initial design decisions were made so.
It does not seem to be impossible to redesign it while maintaining backward compatibility (at least at a first glance) but it would need quite some efforts. For example, if you look at https://docs.janusgraph.org/advanced-topics/data-model/#individual-edge-layout, you can see the adjacent ids are stored using the difference between current vertex id and adjacent vertex id:
JanusGraph does not store the actual vertex id but the difference to the id of the vertex that owns this adjacency list
so, if you have string as vertexId, you will need to store the full adjacent vertex id here.
That is just my first impression though. Not sure if there will be any blocker when real implementation starts. If anyone is interested to do a POC, I am happy to review and help.
I wonder if it's possible to have a similar String-type custom vertex ID as Neptune, where users can provide any string as the vertex ID, and JanusGraph will automatically hash the string and convert it to a Long type internal ID.
Normally hashing have collisions. So, I don't think that's the right approach.
Then our goal is to convert a "number" from base36 to base10.
Yap, as @li-boxuan showed, that's not a good approach because we will exceed the limits with this approach very quickly.
Notice that ID must be unique. Also, usually ID determines where a vertex is stored (i.e. on which node in the cluster).
I think we could simply store a bi-directional map in a storage with string vertex id -> long vertex id. Internally JanusGraph will always work with long vertex id but on the API level we can simply convert all string vertex id to long vertex id before starting to work with the vertex. Thus, we will be able to have the same amount of string vertex ids as we have long vertex ids.
Potentially, if the user wants to return string vertex id for vertices instead of long vertex id for vertices, we could have some configuration flag (probably transaction wise configuration flag) which could be enabled if we want to return string vertex id for vertices.
That said, I'm mostly concerned about consistency of such approach. When two different JanusGraph nodes assign different long vertex ids to the same string vertex id before storing a new vertex in a graph, they could overwrite each other and thus, we would have kind of a zombie vertex after that (it will be actually accessible but only using long vertex id and not string vertex id). Thus, we either need to use locking for insertions with string vertex id enabled or just let user be aware of this problem so that they never try to create the same vertex with the same string id.
What are the difference between implementations that use long ids and long string ids (without any converting)? That it can't be used here? Or were long ids used initially because it seemed just 'easier' to implement it like that?
What long ids do in JanusGraph that long string ids can't?
For example Cassandra uses string ids and is doing just fine. Can the logic be borrowed from there for example?
I personally didn't research that and don't know the initial thoughts behind using Long for ids. What I can do is just assuming.
Many databases have different limits on the keys (including Cassandra with max Key length of 65535). Earlier, some databases would even support only integer keys (like MySQL for example). Mostly, that was enforced for performance considerations and not to simply limit users on their use cases.
JanusGraph by it's nature works with a variety of databases and saying using naturally Strings as vertex ids would definitely impact some of the databases.
Probably, initial authors didn't want to use String data type for vertex id because it's length can be up to 2^32 which wouldn't be supported by many underlying databases.
So, probably, back that time when TitanDB was created, authors decided to proceed with Long because it was a de-facto standard for keys that time (Long was excessively used) and Long datatype has quite a big capacity to store a big range of ids which would most likely be sufficient for most use cases.
As you may know, initially there was (and there is) a default IDs distribution mechanism which can theoretically store up to 2^59 vertices.
2^59 Long IDs without any overhead (if just counting 8 bytes per id) would weight more than 4.611 Exabytes of data (that's just IDs without any data at all).
Back to string ids. I don't think blindly replacing long ids with string ids would be a good approach but adding mapping between long id and string id could actually be sufficient for many users as it will allow to use string ids with the length of 2^32 (i.e. instead of limiting the length of such ids we will just limit the amount of such ids to 2^59).
That said, the above are just my thoughts. I didn't look into the implementation details and I didn't thought about all the possible problems. Nevertheless, I'm happy to discuss the ideas and I will be happy to review someone's work if someone decides to implement this feature.
Are there any updates on this issue? I am also very interested in using uuid's instead of long.
I too am blocked by this issue and will also be using string ids in production.
@ryou90 , @darrenhull this feature isn't in my list of priorities but someone can potentially implement this feature. If you describe your usecases and why you need UUID or String ids as vertex Ids instead of Long ids it could drive some interest in the community. Is it to simplify migration process or something else? I wouldn't generally recommend using vertex ids directly for CQL folks as they are responsible for placement of vertices in the cluster. If you blindly assign vertex ids to vertices without any specific strategy you may end up having disbanded cluster as some partitions would be more loaded than other. That said, if you have specific strategy then it definitely makes sense but on the other hand it's not than clear why UUID or Strings are used and what's problem with using Longs. That said, it's probably off-topic.
@porunov I'll try to start from the very beginning so I can highlight the issues we faced with JG due to the lack of UUID / String support.
-
JanusGraph is advertising itself as a "highly available" database - what this means is you should be able to do all tasks with zero downtime. This, however, is not true with JG, because changing some of the actually important configuration properties require a FULL cluster shutdown for settings to be applied. (I am mentioning this because of
2.) -
JanusGraph will
allocatea range ofids(i.e.[0-1001], [1002, 2003]etc), depending on theids.block-sizeconfiguration. Each instance in the cluster will allocate a range with the same amount. -
Once you start dealing with real data and production scenarios, you will start having performance issues. This will lead you to the wonderful and detailed blog post of Expero - JanusGraph nuts and bolts which will guide you to increase the
ids.block-size. That's great you will think - just flip a switch and all will be fine. But the more you increase theblock-sizethe more end up with wasted IDs. That's so because fine-tuning some of the options require a FULL CLUSTER SHUTDOWN. So if you have a cluster of 32 instances andids.block-size = 1mfor best performance you will potentially have a wast of 32M ids when you need to restart cluster to apply a change ( point 1.). Of course the exact amount of wasted IDs would be unknown since you would have used a portion of but the rest will be unrecoverable forever. In addition to this - fine-tunning properties sometimes requires multiple restarts in a row to find the right balance which adds more wasted ids. (TL;DR - a node gets a range of IDs and this range can be used only by that node, but if you restart the node -- it will get a new range block, making the old one "wasted", because you cannot reuse nor re-cycle the range.) -
Also, depending on the configuration - the number of available IDs per node decreases exponentially:
- JG will flip the first bit to
1Lso that the Long becomes unsigned and only positive numbers are used. (-1 bitreserved). - You should search for
USERVERTEX_PADDING_BITWIDTHin the code - that also reserves another-3 bitsreserved cluster.max-partitionsis part of the "reserved" JG bits, so ifcluster.max-partitions=32==>-5 bitsreserved
18,446,744,073,709,552,000 - 64bit 1,152,921,504,606,847,000 - 60bit - assuming 4 bits are reserved via org.janusgraph.graphdb.idmanagement.IDManager, check 4.1 and 4.2 above 36,028,797,018,963,970 - 55bit - `60 - 5` bits => allows for 16 instances 18,014,398,509,481,984 - 54bit - `60 - 6` bits => allows for 32 instances 9,007,199,254,740,992 - 53bit - `60 - 7` bits => allows for 64 instances 4,503,599,627,370,495 - 52bit - `60 - 8` bits => allows for 128 instancesThese numbers look big and they really are. But depending on the "id waste" generated, they could exponentially decrease.
- JG will flip the first bit to
-
From what I've heard on the community meetings hosted by Expero and the JG team regarding the question "what if you need to increase the cluster above
max-partitions-- you would need to start with a fresh cluster and re-import your data (cluster.max-partitions). This does not in any way work with the term "High Availability" and is one more reason why the current implementation isn't working for us.
The above points should illustrate the issue(s) and why this feature is actually important.
If there was String or UUID support instead of Longs then entire id allocation can be completely avoided. The performance would be significantly better since the JG instance will not be doing ID collision checks when the pool is exhausted. You also wouldn't need to do cluster shutdowns which will improve high availability.
Have a look at the RFC-4122 - A Universally Unique IDentifier (UUID) URN Namespace which was used in the Apollo Network Computing System. In essence it's a 128bit string that has the following parts:
UUID = time-low "-" time-mid "-" time-high-and-version "-" clock-seq-and-reserved clock-seq-low "-" node
time-low = 4hexOctet (collections of 8 bits) = 4 * 8 = 32 bits
time-mid = 2hexOctet (collections of 8 bits) = 2 * 8 = 16 bits
time-high-and-version = 2hexOctet (collections of 8 bits) = 2 * 8 = 16 bits
clock-seq-and-reserved = hexOctet (collections of 8 bits) = 1 * 8 = 8 bits
clock-seq-low = hexOctet (collections of 8 bits) = 1 * 8 = 8 bits
node = 6hexOctet (collections of 8 bits) = 6 * 8 = 48 bits
hexOctet = hexDigit hexDigit
hexDigit = "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" /
"a" / "b" / "c" / "d" / "e" / "f" /
"A" / "B" / "C" / "D" / "E" / "F"
What I'm proposing as an idea:
- JG should migrate to using UUIDs instead of Longs internally. I completely understand this is going to be an
Epictask that needs to be evaluated, discussed and decided upon. It will probably bring backwards incompatible changes. But the benefits would be much more significant.
The benefits of having UUID/String instead of Long:
- If there was UUID support in JG -- it would be possible to completely avoid ID allocation as long as the
node idis unique for each node. This sounds stupid, but every time an instance needs to allocate an id range -- it can take between 100ms and 2seconds for this to actually "happen" (sigh...ids.authority.wait-time, ). During this time the JG instance can be receiving many insert requests which will eventually end up in a deadlock. (been there, done that!) - It will also increase the high availability of JG - you wouldn't need to increase the
ids.block-sizeso no need to shutdown the entire cluster to "fix performance issues". :) - You won't waste IDs even if you need to restart the entire cluster.
- You could use the
nodesegment to set additional bits similarly to how JG does it right now with Longs so the placement of vertexes / edges could be fine tuned? - UUIDs should allow you to have as many nodes as you'd like in a cluster without having deal with
cluster.max-partitions - The above will make JG more attractive to companies that want to use JG as their main db (not just doing analytics)
The drawbacks:
- Some storage backends might not be able to support it.
- Potentially backwards incompatible change.
- Will need consensus among the JG core devs and community.
Thanks @steve-todorov for your feedback and suggestions.
To add some thoughts into discussion. There is a similar ID generation technique to UUID which takes 64 bits instead of 128 bits called Twitter Snowflakes. There are quite a lot of implementations of this algorithm in open source. Quickly googling Java examples brings me to: https://github.com/callicoder/java-snowflake https://github.com/phxql/snowflake-id
I also developed similar id generator tool in 2017 which was slightly more configurable and can generate ids from 1 bit to 192 bits: https://github.com/JBlur/id-generator
This is something which can be used already with JanusGraph (using custom ids assignment) but it will typically allow you to generate ids in distributed fashion for ~69 years only. After that it won't guarantee protection from collisions. Also, this algorithm doesn't allow you to include custom data into it other than instance number.
Back to UUID / String ids discussion I think it might makes sense to allow JanusGraph to accept arbitrary ids as TinkerGraph does. We probably could think of supporting several vertex ID modes (like the default with Long ids and allocation process or custom IDs with arbitrary data). If arbitrary IDs are supported, I think it might help the community to use JanusGraph easier.
I have made a workable implementation at: https://github.com/li-boxuan/janusgraph/commits/feature/string-vertex-id
A few things to notice:
- It allows users to set a custom string-type vertex ID. The vertex ID must only use ASCII characters. Moreover, the usage of dash character '-' is prohibited (because it is a reserved marker in JanusGraph to maintain backward compatibility). This means you cannot directly use UUID until you replace the dash character with something else. This is not finalized; in another POC (https://github.com/li-boxuan/janusgraph/commits/feature/string-vertex-id-allow-any-char), dash characters are allowed, but the overhead will be higher (because we use base64 encoding).
- There is no limitation on the length of the ID (as long as the backend storage allows).
- Property and edge ids are still auto-generated, but it shouldn't be hard to make them customizable as well (if there is enough interest). Then the usage of the ID allocation service will be greatly reduced - it will only be needed for generating IDs for meta-data like schema elements (maybe in the future we can also work on this and make ID allocation service completely optional).
- This functionality works for the in-memory store, Cassandra, and HBase. It does not work for BerkeleyDB.
Any suggestions, feedback, and code review are welcome. If anyone wants to experiment with the feature, clone my branch and build JanusGraph by yourself. At the same time, I will start splitting the work into several smaller commits and creating PRs.
I would like to notice that this is what stops me from migrating to JanusGraph.
Same issue here , I am not able to migrate because of UUID in the current database