GraphEngine
                                
                                
                                
                                    GraphEngine copied to clipboard
                            
                            
                            
                        Question: Graph Engine's CellID generation methods
I have noticed that if the CellID is not given when we new a cell, the code will call CellID = CellIDFactory.NewCellID() to assign the id automatically. NewCellID() method will eventually call Random.NextBytes(byte[] buffer). Could we ignore the id collisions in some rare cases here?
@bingzhangdai nope we should never ignore collisions. You can override the cellid allocation routine to make it collision-aware. this may require distributed consensus and would then be another (big) topic. also related is the string/anyT->cellid hashing algorithm that works in a cluster.
@yatli Thanks! NextSequentialCellID() can generate  a sequentially incremented CellId, which is feasible. Maybe we could bring some ideas from other databases, for example, MongoDB's _id generation method?
I still think this is better to be done by the database, not the user.
btw, just as what you mentioned, I raise this concern from #111 actually. I saw a dramatically high CPU utilization by using hash function and bucket (list) to handle the collisions (like distributed hashtable example) with frequent insert/update queries, especially when the graph has more than hundreds of millions vertices. Collision handling process is really unpleasant. :(
@bingzhangdai agreed.. I can feel the pain, as it's also happening on our side. wrt MongoDB _Id, I believe that it has derived from the BSON objectid:
A BSON ObjectID is a 12-byte value consisting of a 4-byte timestamp (seconds since epoch), a 3-byte machine id, a 2-byte process id, and a 3-byte counter
So, 4 extra bytes than our 64-bit cellids, but with this format I'm afraid that there would be even more collisions (only 3 byte counter, limited machineId/processId representation etc.). Looks like MongoDB ignores this issue (lots of 3rd party drivers so it does not know if it's collision or upsertion, anyway).
But overall I think the collision rate would not be high given a strong hash function. I'll be working on an improved version of the distributed hash table when our dynamic cluster model design is finalized.
@yatli absolutely agree with you. When I have to think of collisions, the coding procedure would be less fluent and schema design would become a little bit complex though.
btw, HashHelper.HashString2Int64(string key) does quite well and collision rate is really low. :)