adiar
adiar copied to clipboard
Dealing with Small BDDs using Depth-First Algorithms
With Adiar v1.2 we have been able to decrease the threshold as to when Adiar is viable in comparison to conventional depth-first BDD packages. Yet, this does not remove the major relative performance decrease for tiny BDDs. The offshot of this is that Adiar still has very undesirable characteristics in contexts, where one is manipulating a lot of tiny decision diagrams (e.g. our Picotrav benchmarks).
Interestingly, the threshold we derived in our experiments is somewhat below a threshold in the CAL BDD package [Sanghavi96]: if below this size, CAL switches from its breadth-first to the conventional depth-first algorithms. Hence, it is likely that we have hit a lower bound as to when using time-forward processing is a viable option.
For an end-user, it is of interest to have Adiar be efficient across the board. That is, we also need to be competitive for the tiniest of instances. To this end, we will have to (similar to CAL) switch to a unique node table and use depth-first algorithms. To achieve this, we have to go through the following fours steps. From a research perspective, the most interesting part is the first two - Jaco would like to investigate those by themselves.
Part 1: Implement a Unique Node Table
From our experiments, the BDD package BuDDy is the fastest implementation. My hypothesis is, that it is due to its extremely simple design in terms of code, but also due to the way it lays out the unique node table.
BuDDy's Hashtable
The hash-table is for values of type T
(where T
is some type of a node). It is stored as single list of elements entry<T>
which adds the the two (indices) hash and next. The hash-value of an element of type T
is used to look up the entry<T>
at said index; the hash value is the index for the first entry in that bucket. The bucket is traversed with the next pointer until the desired node is found.
To this end, all nodes are initialised with the default value T{}
and the next pointer for element i set to i+1. This creates a long chain of free nodes. A variable free
initially is set to i = 0. When a new node node is added to the table, the next free node i is used, free
is updated to table[i].next
before it is set to null.
Bucket Ordering
When adding a node to a bucket described with the next pointers, one could place it in many different places. Which actually is best should be determined based on experiments.
- At the start? This would be beneficial assuming said node is likely to be needed soon again. Furthermore, it might make the code slightly cleaner
- Maintain a sorting based on table index? This would improve cache-locality, since one only jumps forwards with respect to the addresses.
- At the end? This would be beneficial with regards to cache-locality, since we are only touching the most recent cache-lines. This might also be a simple way to approximate the sorting on a table index.
- Maintain a sorting based on Adiar's node uid? This would result in a hash-independent early-termination. Yet, it might be worse for cache-performance.
BuDDy adds the new nodes at the front of the bucket.
Garbage Collection
For garbage collection, each entry<T>
also includes a ref_count variable. When allocating a node, the reference count of the children are incremented by one. Here, the dd
class also updates the reference count of the root on its constructors, assignments, and destructors.
The basic idea behind collection is to free all nodes that have a reference count of zero (recursively decrementing their children). Here, the freed entry's next pointer is updated to be free
before being placed at the front of the free
list (its content does not need to be reset to T{}
as that will be fixed on the next allocation).
BuDDy's garbage collection essentially is a mark-and-sweep in two simple linear scans (which is good for the CPU cache performance).
-
In a forward iteration all non-dead nodes are marked and the hash index is reset back to
null
. -
In a backwards iteration, if a node is to-be kept then its hash it is appended/prepended to its bucket.
If a node needs to be freed, then the
free
chain is updated as described above. Since the iteration is backwards, this ensures that all allocated nodes stay clumped together towards the beginning of the array.
A garbage collection run also empties out the computation cache (or at least remove all the entries that reference a freed node). This is another linear iteration for each computation cache.
Part 2: Sweeping through the Unique Node Table
How do we best switch from the internal memory (depth-first) to the external memory (time-forward processing) paradigm? How do we go back to the (depth-first) unique node table when the result of some time-forward processing algorithm turns out to be small enough to fit into it?
We could do so by converting the data from one paradigm to the other. But, this has a computational overhead that in fact is not needed! This is because we can run the time-forward processing algorithms on inputs from the depth-first algorithms and also feed their results back into the unique node table.
Unique Node Table in Top-down Sweeps
Adiar's top-down algorithms exploit a levelized ordering of the unique identifier (uid) for a node. Currently, they use a node_stream
to seek(...)
( #401 ) the uid within a file on disk.
For the algorithm to be correct, this node does not have to be read from disk. The uid of a node from the depth-first algorithm merely is its index in the unique node table. Hence, we can replace the node_stream
within the algorithm with another class that merely does a O(1) lookup in the unique node table on each seek(...)
.
This can be done merely by templating the algorithm as for v1.2. In fact, the work on v2.0.0-beta.3 already implements an algorithm that changes the order of Apply to abuse random-access.
Unique Node Table in Reduce Sweeps
The size of the unreduced BDD is an upper bound on the output size of Reduce. We can check whether the unreduced input would fit into the (depth-first) unique node table - assuming none of its current nodes can be reused. If so, we can run the Reduce algorithm where the node_writer
is replaced with the unique node table.
Since we are outputting nodes into the unique node table, we may as well use its much faster hashing to identify duplicate nodes. This entirely skips two sorting steps of the current algorithm, which will (roughly) double its performance.
Both of these things can be done with templates.
Step 3: Implement Depth-first Alternatives
Adiar already has each algorithm split in two: (1) a policy (a compile-time strategy) dictates how recursion flows while (2) a templated sweep-algorithm shuffles the recursions around to make the execution I/O-efficient. Hence, we can add templated depth-first algorithms that use the same policies on the unique node table and a computation cache.
The Computation Cache
The computation cache is a leaky hash table of previous computation results. That is, on a collision, it overwrites the previous entry. If possible, this table should persist across operations, but Adiar's use of generators and predicates probably makes some functions only use a local table.
As noted in [Pastva2023] we can abuse the ref_count
value in the node table to not overpopulate the computation cache with irrelevant entries. If a node has only a single parent, then we can be sure that we will not recurse to it again. Hence, we do not need to add it to the computation cache (and possibly throw out a useful entry). The same also applies to product construction algorithms: if all input nodes only have a single parent, then there is no need to store the result of the computation.
Yet, we may want to store these results for cross-operation caching. Yet, to this end we can lower it to a "secondary priority". That is, it only is added to the computation cache, if the entry is free.
Step 4: Identify the Depth-first Threshold(s)
We want to run depth-first versions of an algorithm, if the following is satisfied
- All inputs reside in the (depth-first) unique node table. That is, none of the inputs are stored on disk.
Furthermore, we may consider to also require one or more of the following to be the case.
- The size of all inputs is below the 219 threshold of CAL (or something equivalent).
- The worst-case output size fits within the unique node table assuming no nodes can be reused. This may be complicated in practice, as garbage collection may provide the space needed. Is it possible to keep track of the number of "potentially dead nodes"? Maybe the "time since latest garbage collection", the number of to-be garbage collected roots, and the BDDs size provides a good enough heuristic?
Notice, these heuristics depend on knowing the BDD size, which in itself is an O(N) operation if it is not stored with the BDD itself. Yet, we can only store it with the bdd
class to not make the node table much bigger in size.
Alternatively, we may just run the depth-first version if possible and only switch to the time-forward processing if the depth-first variant runs out of memory (i.e. we run garbage collection once and see whether we can finish processing).
References
-
[Sanghavi96] Jagesh V. Sanghavi, Rajeev K. Ranjan, Robert K. Brayton, and Alberto Sangiovanni-Vincentelli. “High performance BDD package by exploiting memory hierarchy”. In: Proceedings of the 33rd Annual Design Automation Conference (1996)
-
[Long1998] David E. Long. “The Design of a Cache-Friendly BDD Library”. In: Proceedings of the 1998 IEEE/ACM international conference on Computer-Aided Design (1998)
-
[Pastva2023] Samuel Pastva and Thomas Henzinger. “Binary Decision Diagrams on Modern Hardware”. In: Proceedings of the 23rd Conference on Formal Methods in Computer-Aided Design (2023)
Part 1 : Parent Buckets Optimisations
Pastva and Henzinger [Pastva2023] have some extremely interesting insights: the number of BDD nodes with a lot of parents are guaranteed to be very few. In fact, 70% of all nodes have a single parent, 20% have two parents, 3% have three, and so on.
Hence, we can do as follows
- Replace the hash pointer in
entry<T>
with a parents pointer. Here, parents points to the start of the bucket of all nodes with this as its child. Specifically, where this node is its minimum child (the use of minimum avoids terminals for all but the bottom-most nodes). - The next pointer stores a singly-linked list of all other parents with this as its minimum child. Again, whether a new node should be added to the beginning, the end, or should be kept sorted with respect to the nodes and/or the table is up for experiments.
In practice, we should expect the parent
pointer to immediately provide the desired node. This also has the added benefit, that one does not need to spend a long time initialising the entire node table in the beginning; the BuDDy design requires one does so (which takes a significant amount of time) as all hash
indices have to be available everywhere. That is, with this design one can lazily create the free nodes chain as needed (and initialising the parents pointer to null).
This leaves the bottom-most nodes with only terminal children. Since they are not explicit nodes in the table (?), then we need to store two additional parent pointers next to the table. Yet, these terminals can have up to 3n parents (for n variables) when we use the table for both BDDs and ZDDs. Here, my simplified linear bucket above would have terrible performance.
Pastva and Henzinger propose a more sophisticated version of this idea that also solves the above issue with terminals:
-
Also, replace the next pointer in
entry<T>
withnext_0
, andnext_1
. These two values correspond to edges in something quite similar to a prefix-trie. This trie is traversed bit-by-bit based on the hash-value of the variable label and the other child, i.e. the maximum one. While evaluating this requires a few more CPU cycles (and an extra if-statement) it completely collapses the depth of each buckets.Note: Note that, unlike a prefix-trie one does not traverse up to a unique prefix before being given the result. Instead, nodes are placed as early as possible on a path that matches its prefix. This is a vital insight to be able to include the next_0 and the next_1 pointers within
entry<T>
.Note: I am unsure how this deals with hash collisions. Most likely, it doesn't as it just relies on not having more than 64 nodes share the same path.