miniboxing-plugin
miniboxing-plugin copied to clipboard
ConsList Performance Slowdown
Benchmark with miniboxing
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ConsListBench.appendConsList 10 avgt 30 0.689 ± 0.022 us/op
[info] ConsListBench.appendConsList 100 avgt 30 5.862 ± 0.101 us/op
[info] ConsListBench.appendConsList 1000 avgt 30 56.735 ± 1.560 us/op
[info] ConsListBench.appendList 10 avgt 30 0.157 ± 0.010 us/op
[info] ConsListBench.appendList 100 avgt 30 1.328 ± 0.025 us/op
[info] ConsListBench.appendList 1000 avgt 30 12.906 ± 0.218 us/op
without miniboxing
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ConsListBench.appendConsList 10 avgt 30 0.130 ± 0.004 us/op
[info] ConsListBench.appendConsList 100 avgt 30 1.294 ± 0.041 us/op
[info] ConsListBench.appendConsList 1000 avgt 30 13.498 ± 0.585 us/op
[info] ConsListBench.appendList 10 avgt 30 0.149 ± 0.006 us/op
[info] ConsListBench.appendList 100 avgt 30 1.117 ± 0.024 us/op
[info] ConsListBench.appendList 1000 avgt 30 12.934 ± 0.201 us/op
How to reproduce: clone brique run benchmark in branch master (non miniboxing) and branch miniboxing (with miniboxing)
sbt
project bench
run -i 10 -wi 3 -f3 .*ConsListBench.append.*
I tried in master to implement foldLeft
like in miniboxing branch (with an inner method) but I don't see an impact on performance
I'll have to understand how the ConsList
is transformed in detail, but in principle the expected speedups are around 30% compared to the generic version (at least that's what we got for the Scala linked list with this source code.
I refactored all tailrec methods with while loops in order to be sure they were not the cause of the performance drop. I benchmarked sum
to be closer to how you benchmarked miniboxing and it seems to give a similar improvement:
// with miniboxing
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ConsListBench.sumConsList 10 avgt 30 0.055 ± 0.001 us/op
[info] ConsListBench.sumConsList 100 avgt 30 0.277 ± 0.003 us/op
[info] ConsListBench.sumConsList 1000 avgt 30 2.438 ± 0.041 us/op
[info] ConsListBench.sumList 10 avgt 30 0.045 ± 0.000 us/op
[info] ConsListBench.sumList 100 avgt 30 0.596 ± 0.012 us/op
[info] ConsListBench.sumList 1000 avgt 30 6.322 ± 0.079 us/op
// without miniboxing
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ConsListBench.sumConsList 10 avgt 30 0.046 ± 0.001 us/op
[info] ConsListBench.sumConsList 100 avgt 30 0.365 ± 0.004 us/op
[info] ConsListBench.sumConsList 1000 avgt 30 3.926 ± 0.062 us/op
[info] ConsListBench.sumList 10 avgt 30 0.046 ± 0.001 us/op
[info] ConsListBench.sumList 100 avgt 30 0.589 ± 0.005 us/op
[info] ConsListBench.sumList 1000 avgt 30 6.373 ± 0.068 us/op
However, I found out a case where performance drop by a factor of 10, here is a link to the code: https://github.com/julien-truffaut/brique/blob/bbb28176cd8310a2ccf2c01a8939f210604898bb/core/src/main/scala/brique/ConsList.scala#L154-L168
// with miniboxing
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ConsListBench.reverse2ConsList 10 avgt 30 0.071 ± 0.001 us/op
[info] ConsListBench.reverse2ConsList 100 avgt 30 0.487 ± 0.007 us/op
[info] ConsListBench.reverse2ConsList 1000 avgt 30 4.919 ± 0.072 us/op
[info] ConsListBench.reverseConsList 10 avgt 30 0.546 ± 0.007 us/op
[info] ConsListBench.reverseConsList 100 avgt 30 5.086 ± 0.033 us/op
[info] ConsListBench.reverseConsList 1000 avgt 30 50.056 ± 1.153 us/op
[info] ConsListBench.reverseList 10 avgt 30 0.039 ± 0.001 us/op
[info] ConsListBench.reverseList 100 avgt 30 0.347 ± 0.003 us/op
[info] ConsListBench.reverseList 1000 avgt 30 3.918 ± 0.044 us/op
I run the entire benchmark with and without miniboxing, you can find the results here: https://docs.google.com/spreadsheets/d/1tTPKvkjqKVxXm6BeO0yLm1cFbTtkLGKrLg6LWrVyDZc/edit?usp=sharing
Very strange, this is what I'm getting:
With miniboxing:
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ConsListBench.appendConsList 10 avgt 15 0.198 ± 0.013 us/op
[info] ConsListBench.appendConsList 100 avgt 15 1.350 ± 0.054 us/op
[info] ConsListBench.appendConsList 1000 avgt 15 13.756 ± 0.661 us/op
[info] ConsListBench.appendConsList 10000 avgt 15 111.328 ± 4.386 us/op
[info] ConsListBench.appendConsList 100000 avgt 15 1345.934 ± 34.768 us/op
[info] ConsListBench.appendConsList 1000000 avgt 15 20270.947 ± 1249.934 us/op
[info] ConsListBench.appendList 10 avgt 15 0.156 ± 0.009 us/op
[info] ConsListBench.appendList 100 avgt 15 1.251 ± 0.060 us/op
[info] ConsListBench.appendList 1000 avgt 15 11.972 ± 0.197 us/op
[info] ConsListBench.appendList 10000 avgt 15 244.344 ± 6.984 us/op
[info] ConsListBench.appendList 100000 avgt 15 2464.826 ± 171.291 us/op
[info] ConsListBench.appendList 1000000 avgt 15 11816.217 ± 1596.650 us/op
without miniboxing:
[info] # Run complete. Total time: 00:06:20
[info]
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ConsListBench.appendConsList 10 avgt 15 0.090 ± 0.002 us/op
[info] ConsListBench.appendConsList 100 avgt 15 0.832 ± 0.031 us/op
[info] ConsListBench.appendConsList 1000 avgt 15 8.634 ± 0.510 us/op
[info] ConsListBench.appendConsList 10000 avgt 15 81.850 ± 5.412 us/op
[info] ConsListBench.appendConsList 100000 avgt 15 1031.133 ± 33.813 us/op
[info] ConsListBench.appendConsList 1000000 avgt 15 15298.762 ± 1205.139 us/op
[info] ConsListBench.appendList 10 avgt 15 0.143 ± 0.006 us/op
[info] ConsListBench.appendList 100 avgt 15 1.246 ± 0.044 us/op
[info] ConsListBench.appendList 1000 avgt 15 11.854 ± 0.792 us/op
[info] ConsListBench.appendList 10000 avgt 15 236.025 ± 6.038 us/op
[info] ConsListBench.appendList 100000 avgt 15 2481.521 ± 165.411 us/op
[info] ConsListBench.appendList 1000000 avgt 15 11137.584 ± 1358.864 us/op
@julien-truffaut can you tell a bit more about your setup? Are you running JVM in client mode?
@DarkDimius I am not sure, how can I check? I am running the benchmarks on my mac book pro
I think I see the problem. Running with -P:minibox:warn-all
to get all warnings, including for the Scala library, I get:
warn] /mnt/data1/Work/Workspace/dev/brique/core/src/main/scala/brique/ConsList.scala:272:
The class scala.Tuple2 would benefit from miniboxing type parameter T1, since it is
instantiated by miniboxed type parameter A of method unapply.
[warn] final case class Cons[@miniboxed A](head: A, tail: ConsList[A]) extends ConsList[A]
[warn] ^
Translated to English: In every match/case such as case Cons(head, tail)
, the head
value is boxed and then unboxed, since it is passed in a Tuple2
, which isn't miniboxed. So the key to the problem is fixing #155 -- from there, we can debug the issue further.
I moved #155 to milestone 0.4, so I fix it by the end of March. But at least we have a lead :)
An alternative approach, which we used in the linked lists, was to expose the head
and tail
explicitly. So the code would be transformed from:
final def foldLeft[@miniboxed B](b: B)(f: (B, A) => B): B = {
var acc = b
var l = this
while(true){
l match {
case Cons(h, t) =>
acc = f(acc, h)
l = t
case CNil() => return acc
}
}
acc
}
to:
final def foldLeft[@miniboxed B](b: B)(f: (B, A) => B): B = {
var acc = b
var l = this
while(true){
l match {
case cons: Cons =>
acc = f(acc, cons.head)
l = cons.tail
case CNil() => return acc
}
}
acc
}
Notice the two cases:
case Cons(h, t) =>
acc = f(acc, h)
l = t
and
case cons: Cons =>
acc = f(acc, cons.head)
l = cons.tail
ah it makes sense, thank you @VladUreche
Seems even with this, miniboxing is still slowed down:
[info] Benchmark (size) Mode Cnt Score Error Units
[info] ConsListBench.appendConsList 10 avgt 15 0.179 ± 0.006 us/op
[info] ConsListBench.appendConsList 100 avgt 15 1.147 ± 0.021 us/op
[info] ConsListBench.appendConsList 1000 avgt 15 11.635 ± 0.171 us/op
[info] ConsListBench.appendConsList 10000 avgt 15 102.367 ± 1.427 us/op
[info] ConsListBench.appendConsList 100000 avgt 15 1238.197 ± 63.866 us/op
[info] ConsListBench.appendConsList 1000000 avgt 15 19298.500 ± 1295.293 us/op
[info] ConsListBench.appendList 10 avgt 15 0.154 ± 0.017 us/op
[info] ConsListBench.appendList 100 avgt 15 1.212 ± 0.065 us/op
[info] ConsListBench.appendList 1000 avgt 15 11.223 ± 0.074 us/op
[info] ConsListBench.appendList 10000 avgt 15 229.220 ± 1.908 us/op
[info] ConsListBench.appendList 100000 avgt 15 2269.490 ± 161.668 us/op
[info] ConsListBench.appendList 1000000 avgt 15 9624.083 ± 1374.385 us/op
The commits are: https://github.com/VladUreche/brique/commit/a13de7b05a53197d97e81311aab5cb232d27bb73 + https://github.com/VladUreche/brique/commit/1f7389c7fd40831c2d678329d18d4782eb1e70d0
I enabled the Apples
test and it works fine :)
I guess the only solution is to take one of the benchmarks and see the bytecode, since I can't tell where the slowdown is coming from...
I am not familiar with byte code, could you show me how to output the byte code of a particular method?
If you do javap -c -private com.mydomain.MyClass
you will get the full bytecode for a class (which will include the particular method.
@julien-truffaut, if you have some time, the best thing you could do would be to take one of the benchmarks (one that exhibits the slowdown) and minimize it to a single file, eliminating all external dependencies -- this way, it's going to be much easier to see the bytecode.
I created the following minimal example: https://github.com/julien-truffaut/brique/blob/miniboxing/core/src/main/scala/brique/TList.scala
here are the results of the benchmarks with miniboxing:
[info] Benchmark (size) Mode Cnt Score Error Units
[info] TListBench.reverseList 10 avgt 100 0.047 ± 0.002 us/op
[info] TListBench.reverseList 100 avgt 100 0.375 ± 0.014 us/op
[info] TListBench.reverseList 1000 avgt 100 4.449 ± 0.152 us/op
[info] TListBench.reverseTList 10 avgt 100 0.082 ± 0.001 us/op
[info] TListBench.reverseTList 100 avgt 100 0.616 ± 0.023 us/op
[info] TListBench.reverseTList 1000 avgt 100 5.933 ± 0.133 us/op
and here is the byte code generated by TListJ
:
public abstract class brique.TList_J<Asp> implements brique.TList<Asp> {
private final byte brique|TList_J|A_TypeTag;
public scala.collection.Iterator<java.lang.Object> productIterator();
Code:
0: aload_0
1: invokestatic #17 // Method scala/Product$class.productIterator:(Lscala/Product;)Lscala/collection/Iterator;
4: areturn
public java.lang.String productPrefix();
Code:
0: aload_0
1: invokestatic #24 // Method scala/Product$class.productPrefix:(Lscala/Product;)Ljava/lang/String;
4: areturn
public final brique.TList<Asp> reverse();
Code:
0: new #28 // class brique/TList$TNil_J
3: dup
4: aload_0
5: getfield #30 // Field "brique|TList_J|A_TypeTag":B
8: invokespecial #34 // Method brique/TList$TNil_J."<init>":(B)V
11: astore_1
12: aload_0
13: astore_2
14: aload_2
15: astore_3
16: aload_3
17: instanceof #36 // class brique/TList$TCons
20: ifeq 69
23: aload_3
24: checkcast #36 // class brique/TList$TCons
27: astore 4
29: new #38 // class brique/TList$TCons_J
32: dup
33: aload_0
34: getfield #30 // Field "brique|TList_J|A_TypeTag":B
37: aload 4
39: aload_0
40: getfield #30 // Field "brique|TList_J|A_TypeTag":B
43: invokeinterface #42, 2 // InterfaceMethod brique/TList$TCons.head_J:(B)J
48: aload_1
49: invokespecial #45 // Method brique/TList$TCons_J."<init>":(BJLbrique/TList;)V
52: astore_1
53: aload 4
55: invokeinterface #48, 1 // InterfaceMethod brique/TList$TCons.tail:()Lbrique/TList;
60: astore_2
61: getstatic #54 // Field scala/runtime/BoxedUnit.UNIT:Lscala/runtime/BoxedUnit;
64: astore 5
66: goto 14
69: aload_1
70: areturn
public brique.TList_J(byte);
Code:
0: aload_0
1: iload_1
2: putfield #30 // Field "brique|TList_J|A_TypeTag":B
5: aload_0
6: invokespecial #60 // Method java/lang/Object."<init>":()V
9: aload_0
10: invokestatic #64 // Method scala/Product$class.$init$:(Lscala/Product;)V
13: aload_0
14: invokestatic #69 // Method brique/TList$class.$init$:(Lbrique/TList;)V
17: return
}
I am not sure what is this line corresponding to:
61: getstatic #54 // Field scala/runtime/BoxedUnit.UNIT:Lscala/runtime/BoxedUnit;
Thanks a lot @julien-truffaut! I will look into this by Saturday evening (SF time).
@VladUreche did you get some time to look into this?
@julien-truffaut no, sorry, I've been working full-time on the ScalaDays presentation. Let me have a quick look now.
@julien-truffaut, I was able to isolate the problem down to a single file:
$ cat gh-bug-188.scala
package brique
import scala.miniboxed
/**
* shortest example to exhibit slowdown with miniboxing
*/
sealed abstract class TList[@miniboxed A] extends scala.Product with scala.Serializable {
import brique.TList._
final def reverse: TList[A] = {
var acc: TList[A] = TNil[A]()
var l = this
while(true){
l match {
case c: TCons[A] => acc = TCons(c.head , acc); l = c.tail
case _ => return acc
}
}
acc
}
}
object TList {
final case class TNil[@miniboxed A]() extends TList[A]
final case class TCons[@miniboxed A](head: A, tail: TList[A]) extends TList[A]
}
object Test {
import TList._
def main(args: Array[String]): Unit = {
var list = (1 to 1000000).foldRight[TList[Int]](TNil())(TCons(_, _))
var i = 0
var j = 0
while (j < 30) {
val start = System.currentTimeMillis
while (i < 100) {
list.reverse
i+= 1
}
val end = System.currentTimeMillis
println((end - start) + " ms for " + i + " iterations")
i = 0
j += 1
}
println("done")
}
}
$ mb-scalac gh-bug-188.scala
$ mb-scala brique.Test
1354 ms for 100 iterations
1119 ms for 100 iterations
1174 ms for 100 iterations
1023 ms for 100 iterations
1177 ms for 100 iterations
1166 ms for 100 iterations
1109 ms for 100 iterations
1123 ms for 100 iterations
1216 ms for 100 iterations
1143 ms for 100 iterations
1220 ms for 100 iterations
1239 ms for 100 iterations
1169 ms for 100 iterations
1193 ms for 100 iterations
1158 ms for 100 iterations
1200 ms for 100 iterations
1168 ms for 100 iterations
1198 ms for 100 iterations
1255 ms for 100 iterations
1156 ms for 100 iterations
1192 ms for 100 iterations
1167 ms for 100 iterations
1227 ms for 100 iterations
1147 ms for 100 iterations
1239 ms for 100 iterations
1142 ms for 100 iterations
1242 ms for 100 iterations
1217 ms for 100 iterations
1150 ms for 100 iterations
1184 ms for 100 iterations
done
$ mb-scalac gh-bug-188.scala -P:minibox:off
Miniboxing plugin warning: Turning off all minboxing specialization!
Miniboxing plugin warning: Optimizing `MbArray` is disabled, thus `MbArray`-s will be generic and will box.
$ mb-scala brique.Test
2123 ms for 100 iterations
1828 ms for 100 iterations
1126 ms for 100 iterations
1014 ms for 100 iterations
1021 ms for 100 iterations
1098 ms for 100 iterations
1134 ms for 100 iterations
1097 ms for 100 iterations
1053 ms for 100 iterations
1106 ms for 100 iterations
1033 ms for 100 iterations
1081 ms for 100 iterations
1049 ms for 100 iterations
1071 ms for 100 iterations
1029 ms for 100 iterations
1030 ms for 100 iterations
999 ms for 100 iterations
1049 ms for 100 iterations
991 ms for 100 iterations
1048 ms for 100 iterations
1037 ms for 100 iterations
1116 ms for 100 iterations
1134 ms for 100 iterations
1047 ms for 100 iterations
1102 ms for 100 iterations
1036 ms for 100 iterations
1035 ms for 100 iterations
1050 ms for 100 iterations
1079 ms for 100 iterations
979 ms for 100 iterations
done
And yes, I can see the 10-20% slowdown. Looking into it now :)
I looked at the bytecode and the JIT compilation process and they're identical across the generic and miniboxed versions of the code. The problem seems to be related to GC cycles: the miniboxed version of the code triggers more GC cycles, which in turn cause the 10-20% performance drop. I tried reasoning about the object size on the heap for miniboxed and generic TCons
cells, but (at least on paper) they seem identical, despite the slightly different layout.
Right now, I would leave this pending since I don't see an obvious direction. You should get a much better performance when you start mapping over the TList, since there you'll have a clear benefit.
@julien-truffaut, WDYT?
Since there was too much text, here's the link: https://gist.github.com/VladUreche/9faa054d78c00fd66f33
Also the -verbose:gc
results: https://gist.github.com/VladUreche/e8ce83f419534f378699
@VladUreche thanks for looking into that. A few points, I would not rely on benchmarking with loops and System.currentTimeMillis
it is very easy to get biased result. I will rerun the bench but I think the difference I observed is closer to 40-50%. Even if it was only a 20% slow down, I think it is a major problem because it is not limited to reverse
, most List
methods are impacted.
@julien-truffaut, I have a proof that the slowdown is indeed caused by the different layout in miniboxed classes and its interaction with the GC.
Here's the proof: We can easily create a list using the miniboxed object-based versions of TNil
and TCons
, which are identical to the generic versions of TNil
and TCons
:
$ diff gh-bug-188-any.scala gh-bug-188.scala
36c36
< var list = (1 to 1000000).foldRight[TList[Any]](TNil())(TCons(_, _))
---
> var list = (1 to 1000000).foldRight[TList[Int]](TNil())(TCons(_, _))
Then, running the benchmark on the generic version of the code (with -P:minibox:off
) and the miniboxed object-based version produces the exact same numbers. Hence, the slowdown can be attributed to the miniboxed class layout compared to the generic and miniboxed but object-based TNil
and TCons
classes.
Now, I think you could use the same technique to debug slowdowns in the other benchmarks: you can change the input to generate the object miniboxed versions of the class in https://github.com/julien-truffaut/brique/blob/miniboxing/bench/src/main/scala/brique/bench/input/TListInput.scala:
@Setup
def setup(): Unit =
tList = (range(size).foldRight[TList[Any]](TNil())(TCons(_, _))).asInstanceOf[TList[Int]]
I will leave the bug open until I have some time to look into all the slowdowns -- some may be legitimate, so I really want to understand the problem.
A few points, I would not rely on benchmarking with loops and System.currentTimeMillis it is very easy to get biased result.
Sure, that makes sense. I wanted a self-contained (literally zero-dependency) benchmark to run in the debug build of the JVM.
I will rerun the bench but I think the difference I observed is closer to 40-50%.
Can you please run with the Any version as well for the input (described in the previous message)? I would be curious to see how much of the slowdown can be attributed to the in-memory class layout. Maybe there are other slowdowns we can address.
Even if it was only a 20% slow down, I think it is a major problem because it is not limited to reverse, most List methods are impacted.
Yes, that's exactly why I left the bug open. Now I'm a bit pressed to write a paper (deadline is on the 26th of March, in 4 days from now) but as soon as I'm done with that I will look into the other slowdowns as well.
One more point: Keep in mind that Scala collections do all kinds of tricks to improve performance, for example the tail
in List
is a variable that can only be set by the builder. This does improve the performance when map
-ping, filter
-ing and flatMap
-ping, since it removes the need to reconstruct the list: https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/immutable/List.scala#L439
@VladUreche no problem take your time, I just wanted to be sure that we agreed on the importance of this issue.
Indeed scala list uses some tricks but we can outperform them almost everywhere (except map
and take
atm). I tried to use miniboxing in order to get even better performance but for the moment, it only improves sum
, almost all other methods are negatively impacted :/
Based on the numbers you gathered, I prepared a version of the spreadsheet comparing generic and miniboxed times:
https://docs.google.com/spreadsheets/d/1tTPKvkjqKVxXm6BeO0yLm1cFbTtkLGKrLg6LWrVyDZc/edit?usp=sharing
- The last column shows
miniboxed time
divided bygeneric time
- I color-coded the cells with green if the performance of miniboxed ops is better, yellow if the slowdown is at most 50% and red if it's above, to help us prioritize
- I removed benchmarks with size = 10, because those tend to be too jittery to be useful (and are outliers most of the time)
Given the red areas, I think the main priority is to check drop
, eq
and flatMap
.
And yes, I agree it's important to have a good idea of what is happening -- for some slowdowns we may not be able to do much, since they're not under our control, but at the very least I must provide a proper explanation of each slowdown and why it occurs.
Btw, I'm really looking forward to seeing the numbers gathered with TList[Any]
. Those should lie in the middle between generic and miniboxing times and should give us an idea of how we stand on each benchmark.
One more thing that came to mind: could you also prepare macro-benchmarks of the TList
operations? What I've seen in some cases is that for microbenchmarks the JVM inlines methods aggressively and removes all overheads thanks to "clean" type profiles. But once you use the collection with multiple functions, types of values etc, the performance suddenly drops. So it would be really nice to have an idea of if/how the performance changes in macrobenchmarks.