insights-core icon indicating copy to clipboard operation
insights-core copied to clipboard

Rework toposort algorithm as an ancestor traversal

Open PaulWay opened this issue 2 years ago • 2 comments

The 'toposort' function was identified as using significant amounts of processor time in real world archive processing. Looking at the code, it does a number of things that make it slow:

  • it copies the entire data dictionary first
  • it adds empty set values to that dictionary for items without a dependency
  • it traverses the entire data dictionary looking for empty set values
  • it rebuilds the data dictionary each time to remove the dependency values that have just been removed

While these operations provide a logical way of processing the data, they are wasteful of both space and processing time.

The new algorithm is a kind of inverse of that method - rather than working on items with a set of dependencies, it works on a dependency with a set of 'ancestors' that depend on that item. It can be seen as a 'data flow' model rather than a 'process flow' model: the original algorithm modelled objects (combiners, parsers, specs) calling other objects, and the new algorithm models the data flowing from the specs to the parsers to the combiners and so on.

The new algorithm starts by creating a dictionary of 'ancestors', whose keys are the objects in the value sets of the original data, and whose values are sets of the keys of the original data. It also stores the count of dependencies on each original data key (object). Before iterating, it also stores the set of 'terminators' - keys in the original data that are not ancestors of any object; these are the objects that nothing depends on. Rather than adding these to the dictionary of ancestors (and thus bulking out the ancestor dictionary), these are stored as a separate set for later.

We then iterate in much the same way as the original algorithm. We find the set of ancestors that (now) have no dependents (which is now a set difference rather than a dictionary traversal), yield that set, then decrement the dependency counts for each of the dependents of those 'free' ancestors and delete them from the dictionary of ancestors. If we found no 'free' ancestors, then we've either traversed our way to the 'terminators' set of items that nothing else depends on, or we've found circular dependencies (as in the original algorithm).

Thus we've reduced the set differences to integer subtractions, the dictionary traversal to a key set difference, and eliminated the original data copy and the set union and bulking.

In testing, this is between 1.4 to 1.5 times faster than the original algorithm over small to medium sized dependency graphs (over 100,000 iterations). In other words, this reduces the time taken by 45% to 50%.

We also update the 'toposort_flatten' function to act as a generator. This doesn't improve the test speed noticeably, but it allows for a slightly faster start to processing the first item and may allow better concurrency in multi-threaded or asynchronous implementations.

Signed-off-by: Paul Wayper [email protected]

All Pull Requests:

Check all that apply:

  • [y] Have you followed the guidelines in our Contributing document, including the instructions about commit messages?
  • [y] Is this PR to correct an issue?
  • [y] Is this PR an enhancement?

PaulWay avatar Jan 04 '23 23:01 PaulWay