Incremental update of trust levels in a dynamic blockchain graph
Within Tribler we aim to calculate and show trust level of our neighbors.
Trust levels evolve over time, as more information is coming in from our blockchain. Restarting the calculation from cratch every time new information comes in is prohibitively expensive.
We require an incremental algorithm which can update our sybil-resilient random walk algorithm. With our dataset from our deployed network it currently takes 6-9 seconds to do a full calculation. When we have a dataset locally of 50000 peers around us, we require an incremental algorithm.
Key related work by Stanford and Twitter engineers "Fast incremental and personalized PageRank":
For global PageRank, we assume that the social network
has n nodes, and m adversarially chosen edges arrive in a
random order. We show that with a reset probability of
!, the expected total work needed to maintain an accurate
estimate (using the Monte Carlo method) of the PageRank
of every node at all times is O( n ln m
!2 ). This is significantly
better than all known bounds for incremental PageRank.
For instance, if we naively recompute the PageRanks as
each edge arrives, the simple power iteration method needs
Ω( m2
ln(1/(1−!)) ) total time and the Monte Carlo method needs
O(mn/!) total time; both are prohibitively expensive. We
also show that we can handle deletions equally efficiently.
- IRWR: incremental random walk with restart
- An Incremental Algorithm for a Generalization of the Shortest-Path Problem
No decent Python implementation of incremental PageRank seems to exist.
More relevant literature: http://repository.tudelft.nl/islandora/object/uuid%3A17adc7bd-5c82-4ad5-b1c8-a8b85b23db1f?collection=education
http://www-sop.inria.fr/members/Konstantin.Avratchenkov/pubs/mc.pdf This paper introduces the Monte Carlo Methods in PageRank value calculation
http://crad.ict.ac.cn/CN/abstract/abstract3238.shtml
This paper doesn't need to store the random path when updating the PageRank Value, compared with the paper: "Fast incremental and personalized PageRank"
The following picture indicates how the edges are added or removed


This is the architecture of the algorithm, every time when the graph is changed, we just need to update the value of VT(x), x belongs to all nodes in the graph, according to the Variation of Graph and the PageRank Value in last time.
General research description: This project will address the problem of trust. Using an experimental-driven approach with real-world datasets and Internet-deployment we aim to significant advance our methods to build trust. First aspect to investigate is taking existing algorithms and determine their viability into a self-organizing systems environment, without central control or oversight. Next steps are obtaining real-world datasets and using techniques from the emerging Big Data field, graph sampling, random walks, and graph processing in general. The desired outcome is a computational efficient algorithm to calculate trust in dynamic graphs with real-world validation.
Excellent work by Harvard University + Zurich: Personalized Hitting Time for Informative Trust Mechanisms Despite Sybils ( thesis version ) Please also read the foundations: https://github.com/dmvaldman/library/tree/master/special%20topics/p2p Graph sampling: http://dl.acm.org/citation.cfm?id=1150479 Study this Java code implementation of the 2010 Stanford Twitter algorithm: https://launchpad.net/jfppr
2005 work: Incremental page rank computation on evolving graphs Python graph code
Why trust and incremental pagerank:
First code: https://github.com/Peiteng/Incremental-Pagerank !
ToDo next meeting:
- generate 1 random graph
- change the network 5%, 10%, 25%, 50% etc.
- calculate your incremental pagerank
- how accurate is it? (re-calculate)
- repeat above for numerous random graphs
Key outcome plot :
- X-axis percent change in graph
- Y-axes accuracy of incremental pagerank
Possible solo "blockchain Engineering" course. ToDo: take trustchain dataset, copy algorithms from state-of-the-art, possible enhance, implement Python code, math beautiful equations on probability & stochastic. @alexander-stannat
We establish a surprising connection between the personalized PageRank algorithm and the stochastic block model for random graphs, showing that personalized PageRank, in fact, provides the optimal geometric discriminant function for separating the communities in stochastic block models over a wide class of functions. Building on this result, we develop stronger classifiers that, although scalable, are competitive with computationally much more demanding methods such as belief propagation.
First sprint:
- [ ] get dataset @jangerritharms
- [ ] understand the initial code by Martijn of PageRank possibly called every 10min.
- [ ] build the most simple version of full PageRank
- stand-alone code
- simple text datafile reading
- Unit testing like example code by Martijn
- Then iterate and improve
- measure calculation time in Python 2.7
- incremental
- heuristic usage
First sprint finished.
I implemented a standard Monte Carlo random walk algorithm (R random walks starting from each node without resets, with average length reset_prob^-1). The page rank is determined by the number of visit times of each node divided by the sum over all visit times (see algorithm 4)
For nodes that are added to the graph we run R random walks starting at the corresponding nodes. For edges that are added to the graph we use the visit times of the node out of which the new edge goes and determine the number of random walks which should pass through the newly added edge. Then we run the given number of random walks starting at the node the new edge connects to and add the visit time of each node to the existing visit times vector. For edges that are removed from the graph we do the same but subtract the visit times of each node from the visit times vector.
I tested the algorithm on a number of different example graphs, including the graph of the multichain data set and determined the execution times.
For the graphs generated by blocks 0-100 and 50-150, it took 24.5 seconds to compute the algorithm.
For the graphs generated by the blocks 0-1000 and blocks 500-1500 respectively, it took 72 seconds to compute the page ranks of both the new and the old graph.
Unfortunately, I couldn't run the algorithm on the entire data base, because Spyder froze.
The next step would now be to optimize the algorithm. These guys divide the graph into a partition Q containing all changed nodes/edges and no outgoing links, and a partition P. Seems like the right direction seeing as the multi chain network only grows in 'one direction' making such a partition easy to obtain.
I couldn't find an English version of the Chinese paper. so I don't quite know what to do with it.
For inspiration: https://github.com/vandenheuvel/incremental-pagerank
Feedback in terms of coding:
- Write comments
- Use a unit testing framework
- Separate code into multiple modules, by function (e.g. move out the database code)
- Follow PEP8
Function wise changes: Compute only with totals Don't use the block graph directly, aggregate blocks for each public key. Either use contribution totals directly as edges, or scale them first (unclear how this should happen: capped at a limit, logarithmically, etc). Possibly, one could also use net contribution, and not have an edge at all if the net total is negative.
All walks from a single node When using this algorithm in practice, there is only one node known to be secure: yourself. Take a random node and compute all values from the perspective of that node.
Continuous updating A mechanism needs to be in place, which updates computed values once a new block arrives. Either using the saving of all walks and replacing them, or updating parts of walks as discussed, or use a vector of computed values.
Think about churn, low priority Maybe, many public keys will never come online again. Maybe, these are best removed from the graph.
Think about speed of convergence Matters a lot, especially if the computed value will determine its updated value after a new edge arrives.
Think about how to set up sybil resistance experiments The end goal is sybil resistance. Could you create test graphs with sybil regions? Can we measure their scores, and in this way test resistance?
I have written a class to incrementally compute the personalized page ranks of a directed graph from the perspective of a predetermined node in the graph. Personalized page rank is an alternate version of page rank where the ranks of nodes are determined by their distance from a given seed node. In this class the page ranks are computed using the monte carlo method, whereby a number of random walks of predetermined lengths originate from the seed node and walk along the edges through the graph. They jump back to the seed node with a given probability at every step and the walk is terminated once it reaches a certain length. If a random walk reaches a "dangling node", i.e. a node with no outgoing edges it is reset as well. A vector of visit times is computed containing the number of times the random walk passes through the individual nodes and the page rank is given by the visit times divided by the accumulated visit times of all nodes in the graph.
The personalized page rank given below is incremental, meaning that it can be recomputed every time the underlying graph structure is modified, i.e. edges and nodes are added or removed. In order to recompute the page ranks, one doesn't have to recompute the entire set of random walks through the graph. Instead, the given set of random walks is modified. The idea behind this is that a given random walk does not pass through every single edge. Only random walks that reach a node for which the outgoing edges have been modified, i.e. an edge is added or removed, or the weight of an edge is changed, need to be recomputed, starting from the first node for which such changes have occurred. I compare my results to the regular power iteration method whereby i used 500 as a maximum number of iterations. The power iteration method is more accurate than the monte carlo mehtod. However, the goal was to implement an incremental pagerank. Seeing as recomputing the power iteration every time the graph changes, the monte carlo algorithm turns out to be the more efficient option. As the graph sizes increases will become more efficient than the power iteration.
I created some unit testing examples where graphs are randomly generated and the page ranks are computed (both monte carlo and power iteration) and then randomly modified, i.e. edges and nodes removed and added. Finally the page ranks are computed again. Once incrementally (monte carlo) and once by power iteration. I assert that both vectors of page rank values are approximately equal.
I found out that the accuracy of the page rank vectors depends largely on the length of the random walks and not on the number of random walks generated. The graphs below show the euclidean distance between the two page rank vectors by length and number of random walks:

We can see that the distance converges approximately to zero as the length of the walks increases and a walk length ca. 80 is very accurate. We also see that the number of random walks plays an unimportant role. Finally I have written a class that opens the database of trustchain transactions and generates a graph, where nodes correspond to public keys (peers) and the weights of edges correspond to the net data flow in between two peers. I ran the monte carlo algorithm on this data set and am now able to efficiently compute updates made to the trustchain database, incrementally.
Finally Martijn gave me access to the new data set which I will incorporate into the algorithm.
Make the implementation scale Available memory seems to be a bottleneck. Re-examine portions of your code on which a lot of time is spent; the goal is to handle the entire dataset. Use generators, delete statements and if necessary, move some of the preprocessing tasks to sqlite.
Normalized error value Make sure error values over different database sizes are comparable. I recommend using an average relative error (relative to the power iteration value).
To cap or not to cap the walk length Two ways to do the walking:
- Stopping and resetting walks probabilistically with a fixed probability
- Stopping walks after a fixed number of steps and resetting probabilistically with a fixed probability
We suspect the first one approximates typical personalized PageRank as implemented by NetworkX, while the second one results in slightly different values. The may explain the asymptotic behaviour of the error as measured in the above post: when the cap on the walk length increases, the second method approaches the first method.
In the next sprint, also pick up the first method and compare.
Sybil resistance Make artificial test cases for sybil resistance. Create one graph without a sybil region, and then compare with graphs which do have one. Vary in how connected the trusted node is to the region (as measured by shortest path, or average path lengths of the walks), as well as its size. Also, test whether extreme transaction sizes influence the rankings significantly.
For completeness, include a quick comparison of the NetworkX PageRank and NetworkX personalized PageRank.
Completing the picture Reflect on whether these results make sense when comparing with the relevant literature. Then, consider the following questions:
- Approximately, which database sizes is your algorithm (leaving aside practical io issues) able to handle? How long does processing take?
- Can you tell if the algorithm is sybil resistant? And is there an theoretical explanation for that?
- Which next steps should be taken, working towards closing this ticket?
We proposed the EscapeLimit algorithm in the past, there are multiple random walks:
- simple random walk
- weighted random walk
- Metropolis-Hastings random walk
- Maximal Entropy random walk
Sybil Resistance:
I simulated a random graph of 500 nodes with a Sybil region of 50 nodes and a range of attack edges from 0 to 50. For each number of attack edges I compute the ROC curve and determined the area under the ROC curve as a measure of the sybil resistance of the algorithm.
I also computed the proportion of false positives and false negatives for each number of attack edges and obtain the following results:

Please no random walks of 100 hops! Sane values for reset probability = [15%, 30%]. Not 1%. See the empirical grounding and old paper of this and https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon

Wrapping up with a few pages report in ipynb format (or old-skool 2-column IEEE format paper).
- focus on short random walks
- 500 honest nodes, 1000 densely connected attackers, 0-500 attack edges which are randomly attached.
- Reset probability = .5, .3, .2, .1 ?
- ER model to connect honest nodes, 10k edges.
- runtime cost! one iteration
- say 5 new nodes and 5 x 20 new edges are added
- this is a single 'update batch'
- measure how many milliseconds it takes
- performance analysis, CPU usage versus accuracy
- for giant future work section. The "no cpu hogging" requirement, an algorithm should take the CPU continuously. Not be idle for 15min and then crunch full-time for 60 seconds. We need a continuous running light background process.
- Key Performance Indicator:
Therefore, we evaluate the fraction of walks escaping into the sybil
region when starting at any node in the honest region, which is called the
escape probability ```
Reviewed final project report:
- URLs broken
- dropedge: http://www.eecs.harvard.edu/cs286r/courses/fall09/papers/tang09.pdf
- $M=\left<P,I,a,\omega\right>$
- $$ w((v_p,vq)):=\sum\limits{i\in{}I:a(i)=(p,q)}w(i,p) $$
- "this agent msut be completely"
- by $\tilde{E} is a matrix of only zeros, with a column of ones at the i-th place, with $i$
- Performance analysis
- simple and easier to understood : probability of escape in Sybil region
- ROC curves are a powerful and complicated measure
In our discussion on September 6, the following argument was made in favor of small hops:
If you walk with a small reset probability and large average length, you extend your reach to the entire network, which we want to avoid. You'll always reach the sybils. "Six degrees of Kevin Bacon".
While it is true that it becomes possible to reach the entire network, but the probability of reaching one of the nodes which you can reach in n steps but not in n - 1 steps is extremely small. Indeed: the number of nodes reachable becomes enormous, but there is still only one walk. The probability that a far away node is visited many times, is extremely small, both because the walk isn't often very long and because there are so many nodes within reach. Instead, the walks will more often visit the nodes which are closer.
The above doesn't necessarily argue for long walks, but it shows that they can't be dismissed with that argument.
When we're approximating personalized pagerank, only the accuracy and computational cost matters. The incremental method using long walks when approximating personalized pagerank with low reset probability is accurate, see the above experiments by @alexander-stannat. Whether the computational costs are acceptable, still needs some discussion in my opinion (see bottom). If one is interested in nodes which are close by, I think it might be worth considering doing sparse matrix-vector multiplication to approximate.
- First phase: A quick connectedness check with a maximum length, to limit the number of nodes for the second phase
- Second phase: Build a sparse transition graph for the nodes found in phase one, and do a few sparse matrix-vector multiplications in numpy to find your probabilities
The result will be similar to personalized pagerank with high alpha. Example: Limit the nodes to third degree connections. Choose alpha = 0.5, and do three multiplications. Had we approximated with random walks, only 12.5% of walks could have reached out of our selection, when assuming equal probabilities for every hop. In practice, people will prefer helping those they know, so this number would be significantly smaller in practice (especially after implementing a peer selection algorithm which gives lowest priority to those with a connection degree that is too large).
As a bonus, the Tribler users could be given a simple choice they can understand: "do you only trust the friends of your friends, or also their friends?".
@devos50 could you weigh in on what, from a tribler design perspective, you think is an acceptable computational cost for such an algorithm? The current implementation would use a single core 100% for less than a minute for each update (updates can be accumulated and processed at once). @synctext mentioned that he prefers an algorithm which instead uses the cpu continuously but never longer than a few ms. @synctext what was the reason for this preference? Do you want to keep these computations on the twisted event loop?
This week, we brainstormed a bit about modularization of Tribler (splitting Tribler up in separate components, like IPv8, VoD, market etc). I think a dedicated 'trust module' would be an excellent first candidate to adhere to a modularized design. As a result, I would like to see this module being executed as a separate subprocess and communicate with the Tribler core using an IPC protocol (like JSON-RPC). Depending on the scheduler in the OS and the number of available cores, this subprocess should have minimal impact on the performance of the main Tribler process and overall user experience.
@devos50 could you weigh in on what, from a Tribler design perspective, you think is an acceptable computational cost for such an algorithm?
If we follow the design I elaborated on in this comment, we have more flexibility regarding trust computations. Yet, it is very hard to come up with reasonable bounds for this. I would suggest playing around with the implementation and get more insight into the scalability and computational requirements of the algorithm first. To answer your question, we would need to know the tradeoff between required computation time and accuracy. I'm not sure whether the performed experiments are sufficient to answer this question yet.
Note that @ichorid can also help you with performance optimizations.
@devos50 while I understand that coming up with reasonable bounds is hard, I do think it is good to at least have a discussion about them. They're a requirement for the solution, and this requirements influences the search for a good algorithm. The one implemented by @alexander-stannat was rejected by @synctext because of its system load profile (alternating a short period of calculations and long periods of idle). This load profile is inherent to the algorithm, and as such the algorithm could have been rejected much sooner than it was. It is my opinion that @alexander-stannat needs, or at least would benefit greatly from knowing, such requirements up front while working towards a solution.
And lastly: isn't it too soon to spend time making performance optimizations? I would argue that other algorithms should be explored first, especially since the previous algorithm was deemed unsuitable.
Reviewed final project report:
* URLs broken * dropedge: http://www.eecs.harvard.edu/cs286r/courses/fall09/papers/tang09.pdf * $M=\left<P,I,a,\omega\right>$ * $$ w((v_p,v_q)):=\sum\limits_{i\in{}I:a(i)=(p,q)}w(i,p) $$ * "this agent msut be completely" * by $\tilde{E} is a matrix of only zeros, with a column of ones at the i-th place, with $i$ * Performance analysis * simple and easier to understood : probability of escape in Sybil region * ROC curves are a powerful and complicated measure
All code in the Report is now running bug free. The LaTeX compatibility issues have been fixed and all outgoing links are now actually outgoing :P
Feel free to check it out .
Next Steps:
Create a plugin to generate visualisations and calculations of trust in the network, using real-time tribler data. Begin with a 100 node graph using networkx library and generate stand-alone GUI, where users can see trustworthiness of their peers. Stand-alone skeleton. Keep it as simple as possible. Then talk to @mitchellolsthoorn about implementing this into tribler itself. Then use real-time data from the tribler client and update incrementally in real-time.
Future Work:
- Update Strategies against CPU hogging
- Accumulation of errors over time and when
- What to do about offline and relay
- What to do about relay nodes?
- Possibly improve accuracy
- Change graph structure to bidirectional to take into account absolute up and download values- Look into the role of relay nodes?
Related work assuming obedient users.
Related work from the ancient times of file sharing and public goods. All ignoring real world security issues, like the Sybil attack. Thus showing the deadly gap between theory and Internet reality.
- Incentives for Sharing in Peer-to-Peer Networks; classical "Agent Utility" model, micropayments, and internal currency "points". 2001 Very early work.
-
Incentives for large peer-to-peer systems
- Incentives to Promote Availability in Peer-to-Peer Anonymity Systems 2005
- Fairness, incentives and performance in peer-to-peer networks 2003
- Free riding: A new challenge to peer-to-peer file sharing systems 2003
- Service capacity of peer-to-peer networks 2004, very early model of Bittorrent tit-for-tat.
- A simple reputation model for BitTorrent-like incentives 2009 work
Netlogo models:
- iterated version of the prisoner's dilemma 2002
-
multiplayer version of the iterated prisoner's dilemma." 2002
-
A Cooperative Species , also on Github
Reality suffers from strategic manipulation, fraud, Sybil attacks, oversupply, and credit hoarding:
- "Sybilproof Reputation Mechanisms", 2005 pioneering work
- Overcoming free-riding behavior in peer-to-peer systems, no answers, but asks the essential question: "How does the availability of cheap identities affect user behavior in P2P systems, and what are its implications on the desired generosity toward strangers?" 2005
- The problem of upload competition in Peer-to-Peer systems with incentive mechanisms our own work from 2012
Dynamic trust metrics for peer-to-peer systems, early work, 2005 "One of the fundamental challenges for peer-to-peer (P2P) systems is the ability to manage risks involved in interacting and collaborating with priorly unknown and potentially malicious parties. Reputation-based trust management can mitigate this risk by deriving the trustworthiness of a certain peer from that peer's behavior history."
Filtering trust opinions through reinforcement learning, AI-solves-everything, 2014 work solution: AI magic + make the problem simpler by assuming a witness can always be inserted into any transaction "we propose the Actor–Critic Trust (ACT) model, which is an adaptive trust evidence aggregation model based on the principles of reinforcement learning. The proposed method dynamically adjusts the selection of credible witnesses as well as the key parameters associated with the direct and indirect trust evidence sources based on the observed benefits received by the trusting entity.
Update:
Very basic GUI is now implemented. The idea was to implement an interactive graph visualising the Tribler network, including the trust rankings of all agents, determined by our current Monte Carlo pagerank implementation. The current implementation is still a skeleton of the final product and looks as follows:
The code can be found here. Ideally, it will resemble the current Tribler network explorer in design and layout.
So far, a few features are available, such as finding the highest ranking (most trustworthy) node in the network. This is the node that has the largest number of visit times out of all nodes in the network. However, it may not be the node everyone can or wants to interact with, due to the decline in transitive trust over multiple hops.
Another feature is the "Show Neighbourhood" button with which a node can get a close-up look of all it's adjacent nodes. The aim for the future is to create a GUI, in which the user can click on an arbitrary node in the network and then zoom into that node's respective neighbourhood. That way one could explore the entire Tribler network, along interaction paths.
Finally, we're working on a "Most Likely Path" feature which will highlight the path in the network along which the majority of random walks in the Monte Carlo algorithm have travelled.
For our next steps, we will enhance the Interface's interactiveness, i.e. clicking on nodes and edges reveal information about them. More aesthetically pleasing design and layout and some additional features such as viewing the overall net contribution of nodes to the network, colour-coding trustworthiness, etc. This can be discussed in our next meeting.
Solid progress. Please focus on feature completeness and leave "GUI beauty" to others...
To reduce rendering CPU consumption you can use "DrawCircular". Hard-code that there is 1 node at the center, 1 inner-circle and at most 1 outer-circle. Fill the outer-circle with all 2-hop neighbors of the node we currently view. For simplicity, draw the inner-circle with even spacing, to simplify things for now. But the inner-circle spacing is actually very simple, position in in the middle of the connected outer-circle peers. However, our dataset probably has outer-circle nodes connected by multiple 2-hop paths :-)
Report in IEEE format now ready. Now It's back to the GUI.
Nice!
Side note, are you sure that's IEEE style? Looks a lot like the ACM template to me ;)