janusgraph icon indicating copy to clipboard operation
janusgraph copied to clipboard

Support String-type custom vertex ID

Open czheo opened this issue 7 years ago • 16 comments

Though #147 is merged, we still need to call graph.getIDManager().toVertexId(custom_id) to explicitly convert a custom vertex ID to a valid Janusgraph vertex ID. I feel it's unfriendly for the end users. For example, when I use gremlinpython library with Janus, I cannot get the IDManager instance from the Python script. Looking at https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-differences.html, I wonder if it's possible to have a similar String-type custom vertex ID as Neptune, where users can provide any string as the vertex ID, and JanusGraph will automatically hash the string and convert it to a Long type internal ID. We don't even need to expose the Long type internal ID to the end users. Are there any blockers preventing us to do so?

czheo avatar Aug 13 '18 23:08 czheo

I would like to notice that this is what stops me from migrating to JanusGraph.

ihoro avatar Jan 21 '20 11:01 ihoro

Are there any updates on this issue? Has anybody looked into this any further? We would really like to be able to use UUID-s, for example.

carlspring avatar Jul 25 '20 00:07 carlspring

Same here. I've been using Janusgraph for local development with deployment to aws neptune. Neptune supports only String id values.

siddharthTyagi avatar Jul 10 '21 03:07 siddharthTyagi

This does not seem to be too complicated. I am planning to work on this recently.

li-boxuan avatar Jul 10 '21 16:07 li-boxuan

It is not technically impossible - in fact, a working implementation is easy. However, it might not work as you expected.

JanusGraph expects the long id to be within the range [1, 36028797018963968), if you are using the default setting for cluster.max-partitions. 36028797018963968 looks like a very large number, but if you consider the mapping from alphanumeric to numeric, this is not that large.

Let's say we want to use digits and lower-case alphabets as custom vertex id. That is 10 + 26 = 36 characters. Now let's consider converting this alphanumeric string to a number. This has to be one-to-one mapping to avoid duplicates. One handy way is to consider the dictionary of the 36 alphanumeric characters as a new base. Then our goal is to convert a "number" from base36 to base10.

For example, let's say, our dictionary is: "0123456789abcdefghijklmnopqrstuvwxyz". Then for a custom vertexId "vid", the converted long value would be:

id = 31 * 36 * 36 + 18 * 36 + 13 = 40837.

We can see how a short 3-character string gets converted to a long value with 5 digits.

If our custom vertexId is "myvid98134", the converted long value would be 597172750407467008 which is already close to the upper bound. If our custom vertexId is "myvid981340", then we will exceed the upper bound thus get rejected.

As you can see, this is very different from Amazon neptune which seemingly allows a long string id. JanusGraph, however, due to its design, cannot support a long string id.

Btw, If anyone is interested, it is just one-liner code:

long id = graph.getIDManager().toVertexId(LongEncoding.decode("myvid001"))

We can integrate this into JanusGraph if there is still interest, after understanding the technical limitation of the possible string id support.

li-boxuan avatar Jul 12 '21 08:07 li-boxuan

@li-boxuan why can't JanusGraph be redesigned to work like Amazon neptune with long string ids?

What were the initial design decisions to choose long ids instead of long string ids?

Can't it work with cassandra and support long string ids?

DennisBB avatar Jul 12 '21 09:07 DennisBB

I am not creator of JanusGraph (or Titan) so I don't know how and why the initial design decisions were made so.

It does not seem to be impossible to redesign it while maintaining backward compatibility (at least at a first glance) but it would need quite some efforts. For example, if you look at https://docs.janusgraph.org/advanced-topics/data-model/#individual-edge-layout, you can see the adjacent ids are stored using the difference between current vertex id and adjacent vertex id:

JanusGraph does not store the actual vertex id but the difference to the id of the vertex that owns this adjacency list

so, if you have string as vertexId, you will need to store the full adjacent vertex id here.

That is just my first impression though. Not sure if there will be any blocker when real implementation starts. If anyone is interested to do a POC, I am happy to review and help.

li-boxuan avatar Jul 12 '21 10:07 li-boxuan

I wonder if it's possible to have a similar String-type custom vertex ID as Neptune, where users can provide any string as the vertex ID, and JanusGraph will automatically hash the string and convert it to a Long type internal ID.

Normally hashing have collisions. So, I don't think that's the right approach.

Then our goal is to convert a "number" from base36 to base10.

Yap, as @li-boxuan showed, that's not a good approach because we will exceed the limits with this approach very quickly.

Notice that ID must be unique. Also, usually ID determines where a vertex is stored (i.e. on which node in the cluster).

I think we could simply store a bi-directional map in a storage with string vertex id -> long vertex id. Internally JanusGraph will always work with long vertex id but on the API level we can simply convert all string vertex id to long vertex id before starting to work with the vertex. Thus, we will be able to have the same amount of string vertex ids as we have long vertex ids. Potentially, if the user wants to return string vertex id for vertices instead of long vertex id for vertices, we could have some configuration flag (probably transaction wise configuration flag) which could be enabled if we want to return string vertex id for vertices. That said, I'm mostly concerned about consistency of such approach. When two different JanusGraph nodes assign different long vertex ids to the same string vertex id before storing a new vertex in a graph, they could overwrite each other and thus, we would have kind of a zombie vertex after that (it will be actually accessible but only using long vertex id and not string vertex id). Thus, we either need to use locking for insertions with string vertex id enabled or just let user be aware of this problem so that they never try to create the same vertex with the same string id.

porunov avatar Aug 13 '21 00:08 porunov

What are the difference between implementations that use long ids and long string ids (without any converting)? That it can't be used here? Or were long ids used initially because it seemed just 'easier' to implement it like that?

What long ids do in JanusGraph that long string ids can't?

For example Cassandra uses string ids and is doing just fine. Can the logic be borrowed from there for example?

DennisBB avatar Aug 13 '21 01:08 DennisBB

I personally didn't research that and don't know the initial thoughts behind using Long for ids. What I can do is just assuming. Many databases have different limits on the keys (including Cassandra with max Key length of 65535). Earlier, some databases would even support only integer keys (like MySQL for example). Mostly, that was enforced for performance considerations and not to simply limit users on their use cases. JanusGraph by it's nature works with a variety of databases and saying using naturally Strings as vertex ids would definitely impact some of the databases. Probably, initial authors didn't want to use String data type for vertex id because it's length can be up to 2^32 which wouldn't be supported by many underlying databases. So, probably, back that time when TitanDB was created, authors decided to proceed with Long because it was a de-facto standard for keys that time (Long was excessively used) and Long datatype has quite a big capacity to store a big range of ids which would most likely be sufficient for most use cases. As you may know, initially there was (and there is) a default IDs distribution mechanism which can theoretically store up to 2^59 vertices. 2^59 Long IDs without any overhead (if just counting 8 bytes per id) would weight more than 4.611 Exabytes of data (that's just IDs without any data at all).

Back to string ids. I don't think blindly replacing long ids with string ids would be a good approach but adding mapping between long id and string id could actually be sufficient for many users as it will allow to use string ids with the length of 2^32 (i.e. instead of limiting the length of such ids we will just limit the amount of such ids to 2^59).

That said, the above are just my thoughts. I didn't look into the implementation details and I didn't thought about all the possible problems. Nevertheless, I'm happy to discuss the ideas and I will be happy to review someone's work if someone decides to implement this feature.

porunov avatar Aug 13 '21 02:08 porunov

Are there any updates on this issue? I am also very interested in using uuid's instead of long.

ryou90 avatar Sep 10 '21 13:09 ryou90

I too am blocked by this issue and will also be using string ids in production.

darrenhull avatar Oct 07 '21 09:10 darrenhull

@ryou90 , @darrenhull this feature isn't in my list of priorities but someone can potentially implement this feature. If you describe your usecases and why you need UUID or String ids as vertex Ids instead of Long ids it could drive some interest in the community. Is it to simplify migration process or something else? I wouldn't generally recommend using vertex ids directly for CQL folks as they are responsible for placement of vertices in the cluster. If you blindly assign vertex ids to vertices without any specific strategy you may end up having disbanded cluster as some partitions would be more loaded than other. That said, if you have specific strategy then it definitely makes sense but on the other hand it's not than clear why UUID or Strings are used and what's problem with using Longs. That said, it's probably off-topic.

porunov avatar Oct 07 '21 10:10 porunov

@porunov I'll try to start from the very beginning so I can highlight the issues we faced with JG due to the lack of UUID / String support.

  1. JanusGraph is advertising itself as a "highly available" database - what this means is you should be able to do all tasks with zero downtime. This, however, is not true with JG, because changing some of the actually important configuration properties require a FULL cluster shutdown for settings to be applied. (I am mentioning this because of 2. )

  2. JanusGraph will allocate a range of ids (i.e. [0-1001], [1002, 2003] etc), depending on the ids.block-size configuration. Each instance in the cluster will allocate a range with the same amount.

  3. Once you start dealing with real data and production scenarios, you will start having performance issues. This will lead you to the wonderful and detailed blog post of Expero - JanusGraph nuts and bolts which will guide you to increase the ids.block-size. That's great you will think - just flip a switch and all will be fine. But the more you increase the block-size the more end up with wasted IDs. That's so because fine-tuning some of the options require a FULL CLUSTER SHUTDOWN. So if you have a cluster of 32 instances and ids.block-size = 1m for best performance you will potentially have a wast of 32M ids when you need to restart cluster to apply a change ( point 1.). Of course the exact amount of wasted IDs would be unknown since you would have used a portion of but the rest will be unrecoverable forever. In addition to this - fine-tunning properties sometimes requires multiple restarts in a row to find the right balance which adds more wasted ids. (TL;DR - a node gets a range of IDs and this range can be used only by that node, but if you restart the node -- it will get a new range block, making the old one "wasted", because you cannot reuse nor re-cycle the range.)

  4. Also, depending on the configuration - the number of available IDs per node decreases exponentially:

    1. JG will flip the first bit to 1L so that the Long becomes unsigned and only positive numbers are used. (-1 bit reserved).
    2. You should search for USERVERTEX_PADDING_BITWIDTH in the code - that also reserves another -3 bits reserved
    3. cluster.max-partitions is part of the "reserved" JG bits, so if cluster.max-partitions=32 ==> -5 bits reserved
         18,446,744,073,709,552,000 - 64bit 
          1,152,921,504,606,847,000 - 60bit - assuming 4 bits are reserved via org.janusgraph.graphdb.idmanagement.IDManager, check 4.1 and 4.2 above
             36,028,797,018,963,970 - 55bit - `60 - 5` bits  => allows for    16 instances
             18,014,398,509,481,984 - 54bit - `60 - 6` bits  => allows for    32 instances
              9,007,199,254,740,992 - 53bit - `60 - 7` bits  => allows for    64 instances
              4,503,599,627,370,495 - 52bit - `60 - 8` bits  => allows for   128 instances
    

    These numbers look big and they really are. But depending on the "id waste" generated, they could exponentially decrease.

  5. From what I've heard on the community meetings hosted by Expero and the JG team regarding the question "what if you need to increase the cluster above max-partitions -- you would need to start with a fresh cluster and re-import your data (cluster.max-partitions). This does not in any way work with the term "High Availability" and is one more reason why the current implementation isn't working for us.

The above points should illustrate the issue(s) and why this feature is actually important. If there was String or UUID support instead of Longs then entire id allocation can be completely avoided. The performance would be significantly better since the JG instance will not be doing ID collision checks when the pool is exhausted. You also wouldn't need to do cluster shutdowns which will improve high availability.

Have a look at the RFC-4122 - A Universally Unique IDentifier (UUID) URN Namespace which was used in the Apollo Network Computing System. In essence it's a 128bit string that has the following parts:

UUID                   = time-low "-" time-mid "-" time-high-and-version "-" clock-seq-and-reserved clock-seq-low "-" node
time-low               = 4hexOctet (collections of 8 bits) = 4 * 8 = 32 bits
time-mid               = 2hexOctet (collections of 8 bits) = 2 * 8 = 16 bits
time-high-and-version  = 2hexOctet (collections of 8 bits) = 2 * 8 = 16 bits
clock-seq-and-reserved = hexOctet  (collections of 8 bits) = 1 * 8 =  8 bits
clock-seq-low          = hexOctet  (collections of 8 bits) = 1 * 8 =  8 bits
node                   = 6hexOctet (collections of 8 bits) = 6 * 8 = 48 bits
hexOctet               = hexDigit hexDigit
hexDigit = "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" /
           "a" / "b" / "c" / "d" / "e" / "f" /
           "A" / "B" / "C" / "D" / "E" / "F"

What I'm proposing as an idea:

  1. JG should migrate to using UUIDs instead of Longs internally. I completely understand this is going to be an Epic task that needs to be evaluated, discussed and decided upon. It will probably bring backwards incompatible changes. But the benefits would be much more significant.

The benefits of having UUID/String instead of Long:

  1. If there was UUID support in JG -- it would be possible to completely avoid ID allocation as long as the node id is unique for each node. This sounds stupid, but every time an instance needs to allocate an id range -- it can take between 100ms and 2seconds for this to actually "happen" (sigh... ids.authority.wait-time, ). During this time the JG instance can be receiving many insert requests which will eventually end up in a deadlock. (been there, done that!)
  2. It will also increase the high availability of JG - you wouldn't need to increase the ids.block-size so no need to shutdown the entire cluster to "fix performance issues". :)
  3. You won't waste IDs even if you need to restart the entire cluster.
  4. You could use the node segment to set additional bits similarly to how JG does it right now with Longs so the placement of vertexes / edges could be fine tuned?
  5. UUIDs should allow you to have as many nodes as you'd like in a cluster without having deal with cluster.max-partitions
  6. The above will make JG more attractive to companies that want to use JG as their main db (not just doing analytics)

The drawbacks:

  1. Some storage backends might not be able to support it.
  2. Potentially backwards incompatible change.
  3. Will need consensus among the JG core devs and community.

steve-todorov avatar Oct 07 '21 18:10 steve-todorov

Thanks @steve-todorov for your feedback and suggestions.

To add some thoughts into discussion. There is a similar ID generation technique to UUID which takes 64 bits instead of 128 bits called Twitter Snowflakes. There are quite a lot of implementations of this algorithm in open source. Quickly googling Java examples brings me to: https://github.com/callicoder/java-snowflake https://github.com/phxql/snowflake-id

I also developed similar id generator tool in 2017 which was slightly more configurable and can generate ids from 1 bit to 192 bits: https://github.com/JBlur/id-generator

This is something which can be used already with JanusGraph (using custom ids assignment) but it will typically allow you to generate ids in distributed fashion for ~69 years only. After that it won't guarantee protection from collisions. Also, this algorithm doesn't allow you to include custom data into it other than instance number.

Back to UUID / String ids discussion I think it might makes sense to allow JanusGraph to accept arbitrary ids as TinkerGraph does. We probably could think of supporting several vertex ID modes (like the default with Long ids and allocation process or custom IDs with arbitrary data). If arbitrary IDs are supported, I think it might help the community to use JanusGraph easier.

porunov avatar Oct 07 '21 19:10 porunov

I have made a workable implementation at: https://github.com/li-boxuan/janusgraph/commits/feature/string-vertex-id

A few things to notice:

  1. It allows users to set a custom string-type vertex ID. The vertex ID must only use ASCII characters. Moreover, the usage of dash character '-' is prohibited (because it is a reserved marker in JanusGraph to maintain backward compatibility). This means you cannot directly use UUID until you replace the dash character with something else. This is not finalized; in another POC (https://github.com/li-boxuan/janusgraph/commits/feature/string-vertex-id-allow-any-char), dash characters are allowed, but the overhead will be higher (because we use base64 encoding).
  2. There is no limitation on the length of the ID (as long as the backend storage allows).
  3. Property and edge ids are still auto-generated, but it shouldn't be hard to make them customizable as well (if there is enough interest). Then the usage of the ID allocation service will be greatly reduced - it will only be needed for generating IDs for meta-data like schema elements (maybe in the future we can also work on this and make ID allocation service completely optional).
  4. This functionality works for the in-memory store, Cassandra, and HBase. It does not work for BerkeleyDB.

Any suggestions, feedback, and code review are welcome. If anyone wants to experiment with the feature, clone my branch and build JanusGraph by yourself. At the same time, I will start splitting the work into several smaller commits and creating PRs.

li-boxuan avatar Jan 08 '22 08:01 li-boxuan

I would like to notice that this is what stops me from migrating to JanusGraph.

Same issue here , I am not able to migrate because of UUID in the current database

thirumalx avatar Dec 08 '22 14:12 thirumalx