stable sort has major performance degradation
The new ToplogicalSort(), based on its StableTopologicalSort(), has really large performance degradation.
A program that used to take 42s, now takes 220-300s (sometimes more), with the vast majority of the time in StableToplogicalSort(). The current usage is the OSS wolfictl, but I will try to create a reproduction that is focused just on this.
The flame graphs below are from running that with the exact same inputs. The only difference is if I use v0.21.0 or v0.22.0
The first graph is calling StableToplogicalSort() with v0.22.0:
The second is calling TopologicalSort() with v0.22.0 (unchanged, which isn't surprising, as it just wraps StableTopologicalSort():
The third is calling TopologocialSort() with v0.21.0:
In the last, the sort doesn't even take enough time to be bothered into the list.
It might not be the sort itself, as other things might have changed between v0.21.0 and v0.22.0 🤷♂️
In the meantime, it should be easy to replicate with a graph that is [string, string], create a graph with maybe 600 nodes and a few edges among them, make sure it isn't cyclical, and then call TopologicalSort(). But I will try to recreate it.
Thank you for filing this issue. I did some research and decided to proceed as follows:
I will create a patch release that reverts TopologicalSort to the old implementation so that it remains unaffected by the performance issues of StableTopolocialSort.
Then I'm going to take a look at the implementation of StableTopologicalSort and its performance penalties. I'm pretty sure the frequent calls to sort.Slice are the problem, and I will have to dig into its performance characteristics and time complexity first. Maybe there is a more time-efficient solution.
I've created a program that will create a disconnected directed graph consisting of simple vertex chains and run StableTopologicalSort on it.
package main
import (
"github.com/dominikbraun/graph"
)
const (
numberOfVertices = 12
lengthOfChains = 4
)
func main() {
if numberOfVertices%lengthOfChains != 0 {
panic("numberOfVertices must be divisible by lengthOfChains")
}
numberOfRows := numberOfVertices / lengthOfChains
g := graph.New(graph.IntHash, graph.Directed())
for i := 1; i <= numberOfVertices; i++ {
_ = g.AddVertex(i)
}
vertex := 1
for i := 1; i <= numberOfRows; i++ {
for j := 1; j <= lengthOfChains; j++ {
if j < lengthOfChains {
if err := g.AddEdge(vertex, vertex+1); err != nil {
panic(err)
}
}
vertex++
}
}
_, _ = graph.StableTopologicalSort(g, func(a, b int) bool {
return a < b
})
}
The number of vertices and the chain length can be set using numberOfVertices and lengthOfChains. Note that numberOfVertices has to be a multiple of lengthOfChains. Maybe this is enough for testing the performance.
Once this is fixed, I'll release another patch release containing the optimized implementation.
I created a slight variant that has a much larger graph, and includes generating a profile file:
package main
import (
"fmt"
"log"
"os"
"runtime/pprof"
"github.com/dominikbraun/graph"
)
const (
numberOfVertices = 500
lengthOfChains = 125
profFile = "/tmp/cpu.prof"
)
func main() {
if numberOfVertices%lengthOfChains != 0 {
panic("numberOfVertices must be divisible by lengthOfChains")
}
numberOfRows := numberOfVertices / lengthOfChains
g := graph.New(graph.IntHash, graph.Directed())
for i := 1; i <= numberOfVertices; i++ {
_ = g.AddVertex(i)
}
vertex := 1
for i := 1; i <= numberOfRows; i++ {
for j := 1; j <= lengthOfChains; j++ {
if j < lengthOfChains {
if err := g.AddEdge(vertex, vertex+1); err != nil {
panic(err)
}
}
vertex++
}
}
f, err := os.Create(profFile)
if err != nil {
log.Fatal(err)
}
defer f.Close()
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
_, _ = graph.StableTopologicalSort(g, func(a, b int) bool {
return a < b
})
fmt.Printf("saved output profile to %s\n", profFile)
}
Even with a 500-node graph, it still is giving only 250ms to sort the graph. How strange that mine give orders of magnitude more. And I was careful here to profile only starting before the sort, not before the creation.
I wonder if it could have to do with the node type? I will try that next.
FYI, I tried adding graph.Acyclic() and graph.PreventCycles(), didn't have a material impact
I modified it to give different string sizes, and different chain sizes. Here are some interesting outputs.
number of vertices: 500
length of chains: 125
string 10 sort time was 1.972775917s
string 20 sort time was 1.175261542s
int sort time was 278.591834ms
number of vertices: 500
length of chains: 25
string 10 sort time was 1.518682292s
string 20 sort time was 1.608444458s
int sort time was 1.466619s
number of vertices: 500
length of chains: 25
string 10 sort time was 1.618567583s
string 20 sort time was 2.099522875s
int sort time was 1.544102917s
number of vertices: 500
length of chains: 10
string 10 sort time was 1.839373541s
string 20 sort time was 1.667635875s
int sort time was 3.910998084s
We can try and get that into a table, but basically:
- long chains provide better performance that shorter chains in int
- long vs short seems a bit similar for string
No idea what is going on. I can try and convert this into a general testing regimen program.
here is the basic one I am using:
package main
import (
"fmt"
"log"
"math/rand"
"os"
"runtime/pprof"
"time"
"github.com/dominikbraun/graph"
)
const (
numberOfVertices = 500
lengthOfChains = 10
profFile = "/tmp/cpu.prof"
)
var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
func randSeq(n int) string {
b := make([]rune, n)
for i := range b {
b[i] = letters[rand.Intn(len(letters))]
}
return string(b)
}
func main() {
if numberOfVertices%lengthOfChains != 0 {
panic("numberOfVertices must be divisible by lengthOfChains")
}
stringg10, stringComp10 := stringGraph(10)
stringg20, stringComp20 := stringGraph(20)
intg, intComp := intGraph()
f, err := os.Create(profFile)
if err != nil {
log.Fatal(err)
}
defer f.Close()
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
string10Start := time.Now()
_, _ = graph.StableTopologicalSort(stringg10, stringComp10)
string10Duration := time.Since(string10Start)
string20Start := time.Now()
_, _ = graph.StableTopologicalSort(stringg20, stringComp20)
string20Duration := time.Since(string20Start)
intStart := time.Now()
_, _ = graph.StableTopologicalSort(intg, intComp)
intDuration := time.Since(intStart)
fmt.Printf("number of vertices: %d\n", numberOfVertices)
fmt.Printf("length of chains: %d\n", lengthOfChains)
fmt.Printf("string 10 sort time was %s\n", string10Duration)
fmt.Printf("string 20 sort time was %s\n", string20Duration)
fmt.Printf("int sort time was %s\n", intDuration)
fmt.Printf("saved output profile to %s\n", profFile)
}
func stringGraph(size int) (graph.Graph[string, string], func(a, b string) bool) {
rand.Seed(time.Now().UnixNano())
numberOfRows := numberOfVertices / lengthOfChains
g := graph.New(graph.StringHash, graph.Directed(), graph.Acyclic(), graph.PreventCycles())
var vertices []string
for i := 0; i < numberOfVertices; i++ {
vertex := randSeq(size)
vertices = append(vertices, vertex)
_ = g.AddVertex(vertex)
}
vertex := 0
for i := 0; i < numberOfRows; i++ {
for j := 0; j < lengthOfChains; j++ {
if vertex < len(vertices) - 1 {
if err := g.AddEdge(vertices[vertex], vertices[vertex+1]); err != nil {
panic(err)
}
}
vertex++
}
}
return g, func(a, b string) bool {
return a < b
}
}
func intGraph() (graph.Graph[int, int], func(a, b int) bool) {
numberOfRows := numberOfVertices / lengthOfChains
g := graph.New(graph.IntHash, graph.Directed(), graph.Acyclic(), graph.PreventCycles())
for i := 1; i <= numberOfVertices; i++ {
_ = g.AddVertex(i)
}
vertex := 1
for i := 1; i <= numberOfRows; i++ {
for j := 1; j <= lengthOfChains; j++ {
if j < lengthOfChains {
if err := g.AddEdge(vertex, vertex+1); err != nil {
panic(err)
}
}
vertex++
}
}
return g, func(a, b int) bool {
return a < b
}
}
FWIW, I tried replacing the less argument with a sort.Interface arg so I could call sort.Sort() instead of sort.Slice(). It seems to shave off about 15%, not insignificant, but not a ton either.
Here is part of the updated dag.go, in case it is helpful.
func TopologicalSort[K comparable, T any](g Graph[K, T]) ([]K, error) {
return StableTopologicalSort(g, nil)
}
func StableTopologicalSort[K comparable, T any](g Graph[K, T], sorter func([]K) sort.Interface) ([]K, error) {
if !g.Traits().IsDirected {
return nil, fmt.Errorf("topological sort cannot be computed on undirected graph")
}
predecessorMap, err := g.PredecessorMap()
if err != nil {
return nil, fmt.Errorf("failed to get predecessor map: %w", err)
}
queue := make([]K, 0)
for vertex, predecessors := range predecessorMap {
if len(predecessors) == 0 {
queue = append(queue, vertex)
}
}
order := make([]K, 0, len(predecessorMap))
visited := make(map[K]struct{})
if sorter != nil {
sort.Stable(sorter(queue))
}
for len(queue) > 0 {
currentVertex := queue[0]
queue = queue[1:]
if _, ok := visited[currentVertex]; ok {
continue
}
order = append(order, currentVertex)
visited[currentVertex] = struct{}{}
for vertex, predecessors := range predecessorMap {
delete(predecessors, currentVertex)
if len(predecessors) == 0 {
queue = append(queue, vertex)
if sorter != nil {
sort.Stable(sorter(queue))
}
}
}
}
// skipping lots of lines
}
You can replace sort.Stable() with sort.Sort() for the other implementation.
In case the updated graphs are helpful, here they are:
sort.Stable:
sort.Sort:
sort.Slice (current implementation):
Finally, I decided to try the experimental new generic sorting package at https://pkg.go.dev/golang.org/x/exp/slices#SortFunc
The signatures for the last one aren't changed at all, just the content of StableToplogicalSort():
import (
"golang.org/x/exp/slices"
)
// ...
if less != nil {
slices.SortFunc(queue, less)
}
instead of sort.Slice() in the 2 places is it called.
These all seem to help. It isn't too bad getting it from 208s to 173-179s, and then down to 158s. For a true performance test, I would need to run each one hundreds of times. I hope this helps.
graph v0.22.1 reverts the implementation of TopologicalSort.
Do we want to pursue any of those Sort alternatives?
@deitch Sorry for the late feedback, I've been busy in my new job the last couple of days. I've re-opened the issue because I'd like to keep track of the performance and maybe go for a ideal or close-to-ideal implementation using a binary heap for the internal queue, as stated in #131.
Congratulations. That's a good reason. Share your LinkedIn profile here for a minute (can delete later), and I'll connect.
@deitch https://www.linkedin.com/in/dominikbraun-io/
Connection sent.