cleora
cleora copied to clipboard
Calculating embeddings for new nodes after training
I am trying to run Cleora on a simple dataset. My TSV file is simple and follows the format of "leads attributes"
l1 <\t> a1 l2 <\t> a1 l1<\t> a2 l3 <\t> a2
Leads are connected to some attributes.
I have Set A which is used to train embeddings for all nodes ( leads and attributes ) in the set.
For new nodes with the same format of "leads attributes" in Set B, I calculate embeddings by using the following 2 methods. Then I use the embeddings for all "leads" nodes of Set A to train XGBoost model and predict on "leads" nodes of Set B to calculate the AUC.
Method 1
I jointly train embeddings by combining Set A and Set B. I get the embeddings for all "leads" nodes. On Set B, the XGBoost model AUC (trained on "leads" embeddings of Set A) is ~0.8
Method 2
I used another method as suggested in another closed issue https://github.com/Synerise/cleora/issues/21 - where I train the embeddings only on Set A. Then for all "leads" nodes of Set B, I extract the embeddings of all the attributes a particular lead is connected to, average and do L2 normalization. Then with the XGBoost model trained on Set A "leads" embeddings, I predict on "leads" embeddings of Set B. The AUC drops to 0.65
Any reason why there is a drop in the AUC using Method 2 which was suggested to calculate embeddings for incoming nodes on the fly ? The alternative is method 1 where I have to retrain the graph by including new nodes every time.
Thanks
Dear @judas123 Some drop in embedding quality would be expected in the 'averaging' scenario, however your drop is big and therefore there's a few things to check. Is your Set B large? Maybe a part of attributes only appear there and are not represented in Set A, therefore no meaningful embeddings have been computed for them. This could be the case if your sets are created based on a temporal basis and some information drift appears with passing time, many new attributes are created and old ones are discarded. Another thing to consider: maybe your Set A is markedly different than Set B according to the underlying logic of the data, e.g. in Set A the 'leads' are children's toys and Set B contains clothing items, etc. Our "node reconstruction" scenario implies that the graph chunks generally have a shared logic, which can be carried over from base graph to the "new" graph connections. Also, do note that by averaging the embeddings you're conducting an extra iteration of Cleora, so might be that you're going a step too far. Therefore you could try embedding your Set A with best_iteration-1 and try to do averaging on these embeddings. Maybe your chosen iteration number should be altogether different when training on Set A only, due to some pronounced differences in the graph. I would check the performance when training embeddings AND model on Set A to see whether the Set A embeddings are well trained. Generally speaking, if your Method 1 works a lot better than Method 2, it makes sense to periodically recompute whole graph embeddings. Cleora is designed to be so efficient that this full recompute scenario can usually be done very often. In fact, this is what we do at our company - we simply retrain on the graph regularly to ensure max possible performance.
Hope this helps! Barbara
@barbara3430 Thanks for the detailed answer.
- Set A on which I build and train the embeddings has around 330K lead nodes and along with the attribute comes to around 2.6 Million edges. Set B has 80K lead nodes connected to the attributes.
- There are no new attributes in Set B, it is just a subset of attributes of Set A.
- I will try to run the suggested methodology of training till best_iteration -1 and then averaging the embeddings.
- If that does not work, probably as you suggested, Method 1 will be the approach I will go ahead with.
Thanks