GraphEngine icon indicating copy to clipboard operation
GraphEngine copied to clipboard

Question: Graph Engine's CellID generation methods

Open bingzhangdai opened this issue 8 years ago • 4 comments

I have noticed that if the CellID is not given when we new a cell, the code will call CellID = CellIDFactory.NewCellID() to assign the id automatically. NewCellID() method will eventually call Random.NextBytes(byte[] buffer). Could we ignore the id collisions in some rare cases here?

bingzhangdai avatar Aug 13 '17 16:08 bingzhangdai

@bingzhangdai nope we should never ignore collisions. You can override the cellid allocation routine to make it collision-aware. this may require distributed consensus and would then be another (big) topic. also related is the string/anyT->cellid hashing algorithm that works in a cluster.

yatli avatar Aug 14 '17 04:08 yatli

@yatli Thanks! NextSequentialCellID() can generate a sequentially incremented CellId, which is feasible. Maybe we could bring some ideas from other databases, for example, MongoDB's _id generation method? I still think this is better to be done by the database, not the user.

btw, just as what you mentioned, I raise this concern from #111 actually. I saw a dramatically high CPU utilization by using hash function and bucket (list) to handle the collisions (like distributed hashtable example) with frequent insert/update queries, especially when the graph has more than hundreds of millions vertices. Collision handling process is really unpleasant. :(

bingzhangdai avatar Aug 14 '17 06:08 bingzhangdai

@bingzhangdai agreed.. I can feel the pain, as it's also happening on our side. wrt MongoDB _Id, I believe that it has derived from the BSON objectid:

A BSON ObjectID is a 12-byte value consisting of a 4-byte timestamp (seconds since epoch), a 3-byte machine id, a 2-byte process id, and a 3-byte counter

So, 4 extra bytes than our 64-bit cellids, but with this format I'm afraid that there would be even more collisions (only 3 byte counter, limited machineId/processId representation etc.). Looks like MongoDB ignores this issue (lots of 3rd party drivers so it does not know if it's collision or upsertion, anyway).

But overall I think the collision rate would not be high given a strong hash function. I'll be working on an improved version of the distributed hash table when our dynamic cluster model design is finalized.

yatli avatar Aug 14 '17 08:08 yatli

@yatli absolutely agree with you. When I have to think of collisions, the coding procedure would be less fluent and schema design would become a little bit complex though.

btw, HashHelper.HashString2Int64(string key) does quite well and collision rate is really low. :)

bingzhangdai avatar Aug 14 '17 11:08 bingzhangdai