wren
wren copied to clipboard
List.sort() very slow if list is already sorted or nearly so.
I noticed recently that List.sort()
, which is based on the Quicksort algorithm, is very slow indeed when applied to lists which are already sorted or nearly so.
For example, this code takes around 5.7 seconds to run on my Core i7 machine:
var list = (1..10000).toList
list.sort()
I tracked down the problem to the choice of pivot in the private partition_
method in wren_core.wren:
partition_(low, high, comparer) {
var p = this[high]
// blah
}
If instead of always using the high value we use the mid value:
partition_(low, high, comparer) {
var mid = ((low + high)/2).floor
this.swap(mid, high)
var p = this[high]
// blah
}
then the code now runs in around 0.014 seconds. Moreover, the effect of adding these two lines has negligible effect on the time taken to sort unsorted lists.
The implementation also has a problem when the list contains repeated elements. For example this code takes around 2.8 seconds to run whatever the pivot choice:
var list = [1] * 10000
list.sort()
However, I can't think of a simple way to solve this particular problem which probably occurs much less often than the 'already sorted' one in practice. If anyone has an idea on this, please post it.
These, of course, are known issues with Quicksort generally which is also very slow when sorting small lists. The latter could be solved by using an insertion sort for lists under 10 elements say but, frankly, I doubt whether it's worth the effort in a simple scripting language such as Wren.
For the last snippet I was wondering. I'm assuming so but did you measure only the sort or both lines? If not what does this version measure at?
var list = List.filled(10000, 1)
list.sort()
Thanks for digging btw, we can probably add some tests around this kinda thing to give us a baseline to compare with and validate against.
I just measured the sort.
The time to create the list is virtually the same at 0.0183 seconds whether one uses: List.filled
or the List.*
operator so possibly they're using the same code under the hood.
Incidentally, the Wikipedia article on Quicksort suggests that using a 'median of three' may be better than a simple midpoint. It was actually a shade slower in my own tests possibly because the cost of interpreting the extra code outweighed any efficiency gained.
Oops, sorry, I messed up the timings for the list creation.
List.*
takes 0.00092 seconds but List.filled
only takes around 1 millionth of a second. However, either way, it's still insignificant compared to the sort time of 2.8 seconds.
Did you tried with a more dumb algorithm, like bubble sort? That way it would use less stack memory and could compute a little less.
here's one I had in my libs laying around
static bubble_sort(list, compare) {
if(list.count == 0) return
var done = false
while(!done) {
done = true
for(i in 0 ... list.count-1) {
var comp = compare.call(list[i], list[i+1])
if(comp > 0) {
var tmp = list[i]
list[i] = list[i+1]
list[i+1] = tmp
done = false
}
}
}
} //bubble_sort
Using the following optimized implementation of an ascending bubble sort:
var bubbleSort = Fn.new { |a|
var n = a.count
while (true) {
var n2 = 0
for (i in 1...n) {
if (a[i - 1] > a[i]) {
a.swap(i, i - 1)
n2 = i
}
}
n = n2
if (n == 0) break
}
}
the results are a bit variable but it's typically taking around 0.0008 seconds for both the 'already sorted' and 'all the same' cases.
@Ruby0x1's version is slightly slower than this.
As expected, this is much quicker than Quicksort even with the modified pivot but, of course, it would be a completely different story if the list contained random values.
one has a predicate so not a direct comparison (and not that it matters grand scheme, interesting none the less!)
After reading through the salient parts of the Wikipedia article, I wonder whether the best solution here would be to change the partition scheme from Lomuto (which we seem to have at present) to Hoare?
On the face of it, this would give reasonable (albeit sub-optimal) performance for the 'already sorted' and 'all the same' cases and should be no slower for unsorted data. The interface of List.sort
would be the same and so this change would be invisible to users.
The alternative would be to stick with what we have, changing the pivot choice to midpoint (or median of three) to give reasonable performance for already sorted lists and just accept the hit on cases where a lot of the data is the same on the grounds that this shouldn't occur very often in practice. We could perhaps put a warning in the docs about this so that knowledgeable users could use an alternative algorithm for this kind of scenario.
If the Hoare version works better in all cases @PureFox48, then it makes sense to consider it. A PR would be the ideal path then, if you're up for it.
OK, when I have a clearer head, I'll try and do a Hoare version and see how it compares.
Do you have numbers with random values with Buble sort? I mean, the stress on the stack is less heavy with it. And in our case it could be enough unless we have a very large amount of data.
You optimized version can optimized a little bit more with: while (n != 0)
and removing: if (n == 0) break
.
Both bubble sort and insertion sort are hopelessly slow when sorting random data even for lists as small as 100 elements.
Check out the Wren results for this Rosetta code task, comparing the performance of various sorting methods, which I did some time ago.
To keep things simple, I think we must stick with quicksort for the List.sort
method which is easily the fastest algorithm for random data of non-trivial size and should (with Hoare partitioning which I'm working on just now) still give reasonable results even when the data is already sorted or all the same.
OK, I've done the Hoare version and it seems to be working OK.
To make it easier to play about with them, I've put it (and also the modified Lomuto code) in separate classes for now.
import "random" for Random
// Lomuto partitioning (as per List.sort()) but with midpoint pivot.
class List2 {
static sort(a) { sort(a) {|low, high| low < high } }
static sort(a, comparer) {
if (!(comparer is Fn)) {
Fiber.abort("Comparer must be a function.")
}
quicksort_(a, 0, a.count - 1, comparer)
return a
}
static quicksort_(a, low, high, comparer) {
if (low < high) {
var p = partition_(a, low, high, comparer)
quicksort_(a, low, p - 1, comparer)
quicksort_(a, p + 1, high, comparer)
}
}
static partition_(a, low, high, comparer) {
var mid = ((low + high)/2).floor // added this line
a.swap(mid, high) // ditto
var p = a[high]
var i = low - 1
for (j in low..(high-1)) {
if (comparer.call(a[j], p)) {
i = i + 1
var t = a[i]
a[i] = a[j]
a[j] = t
}
}
var t = a[i+1]
a[i+1] = a[high]
a[high] = t
return i+1
}
}
// Hoare partitioning as per Wikpedia article.
class List3 {
static sort(a) { sort(a) {|low, high| low < high } }
static sort(a, comparer) {
if (!(comparer is Fn)) {
Fiber.abort("Comparer must be a function.")
}
quicksort_(a, 0, a.count - 1, comparer)
return a
}
static quicksort_(a, low, high, comparer) {
if (low >= 0 && high >= 0 && low < high) {
var p = partition_(a, low, high, comparer)
quicksort_(a, low, p, comparer)
quicksort_(a, p+1, high, comparer)
}
}
static partition_(a, low, high, comparer) {
var mid = ((low + high)/2).floor
var p = a[mid]
var i = low - 1
var j = high + 1
while (true) {
while (true) {
i = i + 1
if (!comparer.call(a[i], p)) break
}
while (true) {
j = j - 1
if (!comparer.call(p, a[j])) break
}
if (i >= j) return j
a.swap(i, j)
}
}
}
var n = 20
var sorted = (1..n).toList
var same = List.filled(n, 1)
var rand = Random.new()
var unsorted = sorted.toList
rand.shuffle(unsorted)
System.print("Tests for List3.sort (Hoare):\n")
System.print(sorted)
List3.sort(sorted)
System.print(sorted)
System.print()
System.print(same)
List3.sort(same)
System.print(same)
System.print()
var unsorted2 = unsorted.toList // make a copy
System.print(unsorted)
List3.sort(unsorted) { |a, b| a < b }
System.print(unsorted)
System.print()
System.print(unsorted2)
List3.sort(unsorted2) { |a, b| a > b } // descending sort
System.print(unsorted2)
/* output:
Tests for List3.sort (Hoare):
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[8, 12, 14, 19, 15, 1, 20, 11, 3, 6, 2, 16, 13, 18, 5, 10, 9, 4, 17, 7]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[8, 12, 14, 19, 15, 1, 20, 11, 3, 6, 2, 16, 13, 18, 5, 10, 9, 4, 17, 7]
[20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
*/
I'm doing some comparative benchmarks and will post those shortly.
OK, here's the performance figures for the existing List.sort
method (Lomuto) with the midpoint pivot:
Benchmarks for List2.sort (modified Lomuto):
Running 'Sorted data size 10' over 10 iteration(s):
Best 0.004 ms Mean 0.004 ms Worst 0.007 ms
Running 'Same data size 10' over 10 iteration(s):
Best 0.005 ms Mean 0.007 ms Worst 0.019 ms
Running 'Random data size 10' over 10 iteration(s):
Best 0.004 ms Mean 0.008 ms Worst 0.024 ms
Running 'Sorted data size 100' over 10 iteration(s):
Best 0.060 ms Mean 0.064 ms Worst 0.096 ms
Running 'Same data size 100' over 10 iteration(s):
Best 0.304 ms Mean 0.309 ms Worst 0.340 ms
Running 'Random data size 100' over 10 iteration(s):
Best 0.076 ms Mean 0.080 ms Worst 0.110 ms
Running 'Sorted data size 1000' over 10 iteration(s):
Best 0.840 ms Mean 0.850 ms Worst 0.908 ms
Running 'Same data size 1000' over 10 iteration(s):
Best 27.987 ms Mean 28.121 ms Worst 28.585 ms
Running 'Random data size 1000' over 10 iteration(s):
Best 1.144 ms Mean 1.159 ms Worst 1.208 ms
Running 'Sorted data size 10000' over 10 iteration(s):
Best 11.382 ms Mean 11.455 ms Worst 11.632 ms
Running 'Same data size 10000' over 10 iteration(s):
Best 2773.198 ms Mean 2777.167 ms Worst 2792.362 ms
Running 'Random data size 10000' over 10 iteration(s):
Best 16.246 ms Mean 16.354 ms Worst 16.724 ms
...and here's the performance figures for the Hoare partitioning version:
Benchmarks for List3.sort (Hoare):
Running 'Sorted data size 10' over 10 iteration(s):
Best 0.005 ms Mean 0.005 ms Worst 0.007 ms
Running 'Same data size 10' over 10 iteration(s):
Best 0.005 ms Mean 0.006 ms Worst 0.007 ms
Running 'Random data size 10' over 10 iteration(s):
Best 0.005 ms Mean 0.006 ms Worst 0.007 ms
Running 'Sorted data size 100' over 10 iteration(s):
Best 0.065 ms Mean 0.067 ms Worst 0.068 ms
Running 'Same data size 100' over 10 iteration(s):
Best 0.080 ms Mean 0.081 ms Worst 0.083 ms
Running 'Random data size 100' over 10 iteration(s):
Best 0.084 ms Mean 0.085 ms Worst 0.086 ms
Running 'Sorted data size 1000' over 10 iteration(s):
Best 0.839 ms Mean 0.846 ms Worst 0.901 ms
Running 'Same data size 1000' over 10 iteration(s):
Best 1.080 ms Mean 1.100 ms Worst 1.167 ms
Running 'Random data size 1000' over 10 iteration(s):
Best 1.183 ms Mean 1.217 ms Worst 1.303 ms
Running 'Sorted data size 10000' over 10 iteration(s):
Best 10.046 ms Mean 10.114 ms Worst 10.231 ms
Running 'Same data size 10000' over 10 iteration(s):
Best 13.034 ms Mean 13.129 ms Worst 13.248 ms
Running 'Random data size 10000' over 10 iteration(s):
Best 14.275 ms Mean 14.372 ms Worst 14.620 m
For good measure, here are some figures for lists containing 100,000 and 1,000,000 random elements.
Benchmarks for existing list.sort (Lomuto):
Running 'Random data size 100000' over 10 iteration(s):
Best 175.197 ms Mean 180.517 ms Worst 184.312 ms
Running 'Random data size 1000000' over 10 iteration(s):
Best 2273.706 ms Mean 2315.149 ms Worst 2338.914 ms
Benchmarks for List2.sort (modified Lomuto):
Running 'Random data size 100000' over 10 iteration(s):
Best 188.390 ms Mean 190.232 ms Worst 192.531 ms
Running 'Random data size 1000000' over 10 iteration(s):
Best 2380.935 ms Mean 2383.646 ms Worst 2392.719 ms
Benchmarks for List3.sort (Hoare):
Running 'Random data size 100000' over 10 iteration(s):
Best 174.246 ms Mean 174.686 ms Worst 175.152 ms
Running 'Random data size 1000000' over 10 iteration(s):
Best 2028.855 ms Mean 2031.921 ms Worst 2036.394 ms
This demonstrates just how fast the quicksort algorithm is, even for large amounts of data and using an interpreted language such as Wren.
Whilst we're at it, I think we should also add ordering operators to the String class so that (amongst other things) we can easily sort strings lexicographically by codepoint. This code should do it:
class String is Sequence {
/* existing stuff */
// Compares this to another string lexicographically by codepoint.
// Returns -1, 0 or +1 depending on whether
// this < other, this == other or this > other respectively.
compareTo(other) {
if (!(other is String)) Fiber.abort("Other must be a string")
if (this == other) return 0
var cp1 = this.codePoints
var cp2 = other.codePoints
var len = (cp1.count <= cp2.count) ? cp1.count : cp2.count
for (i in 0...len) {
if (cp1[i] < cp2[i]) return -1
if (cp1[i] > cp2[i]) return 1
}
return (cp1.count < cp2.count) ? -1 : 1
}
// Comparison operators.
< (other) { compareTo(other) < 0 }
<= (other) { compareTo(other) <= 0 }
> (other) { compareTo(other) > 0 }
>= (other) { compareTo(other) >= 0 }
}
Please open a separate discussion for this, it can be optimized more than that.
Done. See #1142.
Thanks for the detailed posts @PureFox48, it's always appreciated ✨
For reference, can you benchmark the algorithm provided by List
, and the Hoare applied to it (It should optimize a little bit more, because functional is a little bit sub-optimal in wren).
The only one we haven't benchmarked so far is the existing list.sort()
method which is based on Lomuto partitioning but with the high value pivot. Consequently, it's very slow when the data is already sorted or is the same.The figures for that are given below:
Benchmarks for existing list.sort (Lomuto):
Running 'Sorted data size 10' over 10 iteration(s):
Best 0.007 ms Mean 0.008 ms Worst 0.012 ms
Running 'Same data size 10' over 10 iteration(s):
Best 0.005 ms Mean 0.005 ms Worst 0.006 ms
Running 'Random data size 10' over 10 iteration(s):
Best 0.003 ms Mean 0.004 ms Worst 0.004 ms
Running 'Sorted data size 100' over 10 iteration(s):
Best 0.578 ms Mean 0.583 ms Worst 0.595 ms
Running 'Same data size 100' over 10 iteration(s):
Best 0.296 ms Mean 0.300 ms Worst 0.304 ms
Running 'Random data size 100' over 10 iteration(s):
Best 0.069 ms Mean 0.070 ms Worst 0.070 ms
Running 'Sorted data size 1000' over 10 iteration(s):
Best 53.978 ms Mean 55.671 ms Worst 57.360 ms
Running 'Same data size 1000' over 10 iteration(s):
Best 27.065 ms Mean 27.468 ms Worst 28.218 ms
Running 'Random data size 1000' over 10 iteration(s):
Best 1.151 ms Mean 1.161 ms Worst 1.213 ms
Running 'Sorted data size 10000' over 10 iteration(s):
Best 5471.917 ms Mean 5604.434 ms Worst 5644.828 ms
Running 'Same data size 10000' over 10 iteration(s):
Best 2754.123 ms Mean 2757.615 ms Worst 2764.081 ms
Running 'Random data size 10000' over 10 iteration(s):
Best 14.600 ms Mean 14.767 ms Worst 15.056 ms
List2.sort
differs from the above in that it uses a mid-value pivot to remedy the 'already sorted' problem.
List3.sort
uses Hoare partitoning to remedy both the 'already sorted' and 'are the same' problems.
To ensure all 3 methods use exactly the same data, I've updated the figures for the other two in my earlier posts.
My overall conclusion is that we should change to the Hoare method which gives reasonable results across the board and seems a shade faster than Lomuto for large lists.
As it seems clear from these figures that a change to Hoare partitioning will be beneficial, I've now opened a PR for it.
By having the preconditions of count > 1
are the conditions low >= 0 && high >= 0
not always true and could be removed?
The only time when high
is less than 0 is when the list is empty and it's then equal to -1. However, low
is then 0 and the condition low < high
then fails and the method just returns without doing anything.
So, yes, we can get rid of the low >= 0 && high >= 0
part and I'll change the code in the PR when I have time.