Compute.scala Clarify benchmarks?

Hey folks:

First of all, I'm kind of dismayed you guys didn't talk to us about your findings. You guys are making some heavy claims. We are about to do a release this week. We were busy essentially writing our own TF including import.

From an initial look, you guys didn't do your benchmarks properly.

You are missing workspaces in your microbenchmarks which defeats the purpose of benchmarking nd4j.
You don't show any of nd4j's native configuration or any of the memory configurations you guys tried. I know it's in your guys' interest to have your own framework. That's up to you.

We'll do a blog post to correct a lot of these misconceptions you guys are perpetuating here, but in the mean time, we'll do our best to clarify questions you guys have. We don't mind having competition. It's great to keep us on our toes, but we want to make sure anything representing us is at least somewhat fair (even just handling some of the lower hanging fruit)

Apr 02 '18 04:04 agibsonccc

Unfortunately i'm not Scala guy, so i'd like to ask few more questions regarding tests equality, if you don't mind.

https://github.com/ThoughtWorksInc/Compute.scala/blob/nvidia-gpu/benchmarks/src/jmh/scala/com/thoughtworks/compute/benchmarks.scala#L144-L181

Line 144: Was number of threads set to 1 in runtime? Or number of CPU threads were > 1 during both tests? Lines 153 & 168: Does that means you've included input generation time to Tanh measurements? Line 170: So, you do use "cache" for your library, but not using workspaces for ND4j? Nice. Line 174: foldLeft InlineTensor does in this case it means that operation in this case is executed in-place? As in "input array gets modified and returned back" once .flatArray() method is called?

Apr 02 '18 05:04 raver119

2018-04-02 13:21 GMT+08:00 raver119 [email protected]:

Unfortunately i'm not Scala guy, so i'd like to ask few more questions regarding tests equality, if you don't mind.

https://github.com/ThoughtWorksInc/Compute.scala/ blob/nvidia-gpu/benchmarks/src/jmh/scala/com/thoughtworks/compute/ benchmarks.scala#L144-L181

Line 144: Was number of threads set to 1 in runtime? Or number of CPU threads were > 1 during both tests?

CPU threads were > 1 during both tests. I suggest you learn to use JMH, because it is a very good tool when you are implementing a performance-critical framework.

Lines 153 & 168: Does that means you've included input generation time to Tanh measurements?

The input generation will be completed in warm-up iterations. I recommend you read this: http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/ . I hope the article may help you better understand how Scala lazy val works.

Line 170: So, you do use "cache" for your library, but not using workspaces for ND4j? Nice.

The cache method is an equivalent to lazy val in the ND4J initialization. It just allocate a buffer for input, not for the rest computation.

Line 174: foldLeft InlineTensor does in this case it means that operation in this case is executed in-place? As in "input array gets modified and returned back" once .flatArray() method is called?

Compute.scala does not support any in-place mutable operation. The foldLeft InlineTensor means merging multiple tanh calls into one kernel program.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ThoughtWorksInc/Compute.scala/issues/137#issuecomment-377857399, or mute the thread https://github.com/notifications/unsubscribe-auth/AAktupy5LaIGFoPJTNtpOa4Q5DBCMKcyks5tkbV3gaJpZM4TDI70 .

-- 杨博 (Yang Bo)

Apr 02 '18 06:04 Atry

On Apr 1, 2018, at 23:18, 杨博 (Yang Bo) [email protected] wrote:

2018-04-02 13:21 GMT+08:00 raver119 [email protected]:

Unfortunately i'm not Scala guy, so i'd like to ask few more questions regarding tests equality, if you don't mind.

https://github.com/ThoughtWorksInc/Compute.scala/ blob/nvidia-gpu/benchmarks/src/jmh/scala/com/thoughtworks/compute/ benchmarks.scala#L144-L181

Line 144: Was number of threads set to 1 in runtime? Or number of CPU threads were > 1 during both tests?

CPU threads were > 1 during both tests. I suggest you learn to use JMH, because it is a very good tool when you are implementing a

I’m sorry, bu my question actually had a bit deeper sense. JMH allows runtime override for Thread annotation, and the only thing i’ve asked you - if there was override when you was running your code or no. I take your answer as no, though. So next question is: what’s the idea of testing CUDA in multiple parallel CPU threads? Was workload small enough?

p.s. Thanks for advice.

Lines 153 & 168: Does that means you've included input generation time to Tanh measurements?

The input generation will be completed in warm-up iteration. I recommend you read this: http://fdahms.com/2015/10/14/scala-and-the-transient-lazy- val-pattern/ . I hope the article may help you better understand how Scala lazy val works.

Oh, interesting.

Line 170: So, you do use "cache" for your library, but not using workspaces for ND4j? Nice.

The cache method is an equivalent to lazy val in the ND4J initialization. It just allocate a buffer for input, not for the rest computation.

Perfect. See next question below please.

Line 174: foldLeft InlineTensor does in this case it means that operation in this case is executed in-place? As in "input array gets modified and returned back" once .flatArray() method is called?

Compute.scala does not support any in-place mutable operation. The foldLeft InlineTensor means merging multiple tanh call into one kernel program.

Ok. Let me ask in other words then. Lines 170:178. There’s loop with numberOfIterations iterations. How many independent memory buffers allocated there? 0? 1? 2? numberOfIterations or numberOfIterations x 2?

EDIT: I mean off-heap buffers, available for the GPU.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ThoughtWorksInc/Compute.scala/issues/137#issuecomment-377857399, or mute the thread https://github.com/notifications/unsubscribe-auth/AAktupy5LaIGFoPJTNtpOa4Q5DBCMKcyks5tkbV3gaJpZM4TDI70 .

-- 杨博 (Yang Bo) — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ThoughtWorksInc/Compute.scala/issues/137#issuecomment-377864516, or mute the thread https://github.com/notifications/unsubscribe-auth/ALru_3AdMzdzvQ-F0oKeMDAEKTcKPmv5ks5tkcLQgaJpZM4TDI70.

Apr 02 '18 06:04 raver119

2018-04-02 14:49 GMT+08:00 raver119 [email protected]:

On Apr 1, 2018, at 23:18, 杨博 (Yang Bo) [email protected] wrote:

2018-04-02 13:21 GMT+08:00 raver119 [email protected]:

Unfortunately i'm not Scala guy, so i'd like to ask few more questions regarding tests equality, if you don't mind.

https://github.com/ThoughtWorksInc/Compute.scala/ blob/nvidia-gpu/benchmarks/src/jmh/scala/com/thoughtworks/compute/ benchmarks.scala#L144-L181

Line 144: Was number of threads set to 1 in runtime? Or number of CPU threads were > 1 during both tests?

CPU threads were > 1 during both tests. I suggest you learn to use JMH, because it is a very good tool when you are implementing a

I’m sorry, bu my question actually had a bit deeper sense. JMH allows runtime override for Thread annotation, and the only thing i’ve asked you - if there was override when you was running your code or no. I take your answer as no, though. So next question is: what’s the idea of testing CUDA in multiple parallel CPU threads? Was workload small enough?

p.s. Thanks for advice.

Lines 153 & 168: Does that means you've included input generation time to Tanh measurements?

The input generation will be completed in warm-up iteration. I recommend you read this: http://fdahms.com/2015/10/14/ scala-and-the-transient-lazy- val-pattern/ . I hope the article may help you better understand how Scala lazy val works.

Oh, interesting.

Line 170: So, you do use "cache" for your library, but not using workspaces for ND4j? Nice.

The cache method is an equivalent to lazy val in the ND4J initialization. It just allocate a buffer for input, not for the rest computation.

Perfect. See next question below please.

Line 174: foldLeft InlineTensor does in this case it means that operation in this case is executed in-place? As in "input array gets modified and returned back" once .flatArray() method is called?

Compute.scala does not support any in-place mutable operation. The foldLeft InlineTensor means merging multiple tanh call into one kernel program.

Ok. Let me ask in other words then. Lines 170:178. There’s loop with numberOfIterations iterations. How many independent memory buffers allocated there? 0? 1? 2? numberOfIterations or numberOfIterations x 2?

1 buffer, which stores the result. I know it's not fair to ND4J, but I really don't know how to make ND4J merge multiple immutable operations into one.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/ThoughtWorksInc/Compute.scala/ issues/137#issuecomment-377857399>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ AAktupy5LaIGFoPJTNtpOa4Q5DBCMKcyks5tkbV3gaJpZM4TDI70> .

-- 杨博 (Yang Bo) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/ ThoughtWorksInc/Compute.scala/issues/137#issuecomment-377864516>, or mute the thread <https://github.com/notifications/unsubscribe- auth/ALru_3AdMzdzvQ-F0oKeMDAEKTcKPmv5ks5tkcLQgaJpZM4TDI70>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ThoughtWorksInc/Compute.scala/issues/137#issuecomment-377868668, or mute the thread https://github.com/notifications/unsubscribe-auth/AAktuoIdnTyUL1bad4o3paoE0wh4YH4jks5tkcnzgaJpZM4TDI70 .

-- 杨博 (Yang Bo)

Apr 02 '18 06:04 Atry

You are very keen to realize that the performance issue in the benchmark is related to the memory.

For memory settings, you can check the benchmark source code. JMH forks JVM when running, the JVM flags are provided by annotations. No annotation means default JVM configuration.

For workspaces, I have to confess that I am not familiar with workspace. But the purpose of this Compute.scala library is allowing people do not care about memory even when they are using arbitrary immutable operations. I understand a carefully optimized application written in ND4J is good. But it seems that many users of ND4J are not smart enough to avoid OutOfMemoryError: https://github.com/deeplearning4j/deeplearning4j/issues?utf8=%E2%9C%93&q=OutOfMemoryError

2018-04-02 12:35 GMT+08:00 Adam Gibson [email protected]:

Hey folks:

First of all, I'm kind of dismayed you guys didn't talk to us about your findings. You guys are making some heavy claims. We are about to do a release this week. We were busy essentially writing our own TF including import.

From an initial look, you guys didn't do your benchmarks properly.

You are missing workspaces in your microbenchmarks which defeats the purpose of benchmarking nd4j.

You don't show any of nd4j's native configuration or any of the memory configurations you guys tried. I know it's in your guys' interest to have your own framework. That's up to you.

We'll do a blog post to correct a lot of these misconceptions you guys are perpetuating here, but in the mean time, we'll do our best to clarify questions you guys have.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ThoughtWorksInc/Compute.scala/issues/137, or mute the thread https://github.com/notifications/unsubscribe-auth/AAktusVBZIKmpA_yOJepxg5QPMjFuVssks5tkaqZgaJpZM4TDI70 .

-- 杨博 (Yang Bo)

Apr 02 '18 07:04 Atry

2018-04-02 14:49 GMT+08:00 raver119 [email protected]:

I’m sorry, bu my question actually had a bit deeper sense. JMH allows runtime override for Thread annotation, and the only thing i’ve asked you - if there was override when you was running your code or no. I take your answer as no, though. So next question is: what’s the idea of testing CUDA in multiple parallel CPU threads? Was workload small enough?

p.s. Thanks for advice.

Yes, the purpose is to avoid GPU starving. I did not test but I guess Compute.scala will even be faster on larger arrays with less threads, because it reduces the overhead of the driver, considering NVIDIA OpenCL driver has a higher overhead than CUDA.

-- 杨博 (Yang Bo)

Apr 02 '18 07:04 Atry

I see.

This benchmark isn't comparing apples to apples. Thanks for your time.

Apr 02 '18 14:04 raver119

You can see there is a notice for Deeplearning4j in the README.md.

We have never criticized the performance of ND4J's in a mutable style.

Apr 02 '18 15:04 Atry

2018-04-02 22:27 GMT+08:00 raver119 [email protected]:

I see.

This benchmark isn't comparing apples to apples. Thanks for your time.

You are wrong. The benchmark is comparing apples to apples: immutable operations vs immutable operations

Unfortunately, ND4J does not have the feature of dynamic kernel generation, or JIT in PyTorch's terminology.

For example, when you compare the performance of Lua runtimes between Lua and Luajit . You can say it's not fair because Luajit support JIT. But it's still comparing the same semantics of operations.

I understand ND4J is designed for Deeplearning4j, which only needs mutable in-place operations, so you guys don't have to optimize the performance for other usage.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ThoughtWorksInc/Compute.scala/issues/137#issuecomment-377940986, or mute the thread https://github.com/notifications/unsubscribe-auth/AAktuqoiJatr2nUyAnzR4iHj280WRVmuks5tkjU-gaJpZM4TDI70 .

-- 杨博 (Yang Bo)

Apr 02 '18 16:04 Atry

You are wrong. The benchmark is comparing apples to apples: immutable operations vs immutable operations

Few messages above you've said that 1 array was allocated for the loop. Now you say: it's immutable vs immutable comparison. So which answer is correct?

I.e. your code that uses Nd4j does numberOfIterations x2 allocations, because your Transform.tanh() call creates new INDArray each time, and each INDArray has 2 buffers - 1 on gpu side, 1 on host side. With 5 total iterations your test basically benchmarks CUDA allocation performance, and not actual tanh.

If you call that "apples to apples comparison" - okay, that's up to you :)

re Nd4j for Dl4j. Nd4j just mimics numpy basically.

Apr 02 '18 16:04 raver119

2018-04-03 0:41 GMT+08:00 raver119 [email protected]:

You are wrong. The benchmark is comparing apples to apples: immutable operations vs immutable operations

Few messages above you've said that 1 array was allocated for the loop. Now you say: it's immutable vs immutable comparison. So which answer is correct?

I.e. your code that uses Nd4j does numberOfIterations x2 allocations, because your Transform.tanh() call creates new INDArray each time, and each INDArray has 2 buffers - 1 on gpu side, 1 on host side. With 5 total iterations your test basically benchmarks CUDA allocation performance, and not actual tanh.

You are talking about your implementation, not the behavior. You provide a very good explanation why ND4J's implementation consumes more memory than Compute.scala for the same behavior, which is exactly what the benchmark demonstrated.

If you call that "apples to apples comparison" - okay, that's up to you :)

re Nd4j for Dl4j. Nd4j just mimics numpy basically.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ThoughtWorksInc/Compute.scala/issues/137#issuecomment-377973400, or mute the thread https://github.com/notifications/unsubscribe-auth/AAktut0T_xNV80V8BYAugf0aYGD_FqS_ks5tklSngaJpZM4TDI70 .

-- 杨博 (Yang Bo)

Apr 02 '18 16:04 Atry

No, i've just explained that you hadn't understood how to implement stuff with ND4j efficiently, and did claims like this:

ND4J's implementation consumes more memory

It's not about ND4j implementation. It's about what YOU've implemented. Because obviously the same code could be written without allocating new INDArray on each iteration. Just difference of 1 argument :)

Apr 02 '18 16:04 raver119

P.d. don't get me wrong please. Personally i don't care about your claims etc. You want to claim you're faster then Nd4j? I'm ok with that. If you'll want to claim that you're faster then light - I'll be ok with that as well.

The only reason I was here - is performance feedback. When i hear about Nd4j performance problems - i'm always trying to get to the bottom of the problem, and improve whatever is possible to improve. In this particular case i see - it's a waste of time for me, due to various reasons. Different approaches, bad benchmarking setup, different goals etc.

Thanks for your time.

Apr 02 '18 17:04 raver119

2018-04-03 0:49 GMT+08:00 raver119 [email protected]:

It's not about ND4j implementation. It's about what YOU've implemented.

Suppose a data scientist Alice read a paper and want to reproduce an algorithm in the paper, say, a * b + c. She found that the ND4J/ND4S version of a * b + c consumes four times memory than the Compute.scala version of a * b + c. And Skymind guys blame Alice because she did not refactor her ND4J version to a *= b; a += c. Interesting...

-- 杨博 (Yang Bo)

Apr 02 '18 17:04 Atry

Imagine Alice does some reading of documentation, and instead of:

a.mul(b).addi(c)

does something like:

a.muli(b).addi(c)

Thats when it becomes interesting... :)

Apr 02 '18 17:04 raver119

OK, today I learnt that any one who read the ND4J documentation should never use immutable operators.

I am so curious how fast a.muli(b).addi(c) is. It must be super faster than Compute.scala's slow immutable operators.

I guess I should add a new benchmark for that.

2018-04-03 1:50 GMT+08:00 raver119 [email protected]:

Imagine Alice does some reading of documentation, and instead of:

a.mul(b).addi(c)

does something like:

a.muli(b).addi(c)

Thats when it becomes interesting... :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ThoughtWorksInc/Compute.scala/issues/137#issuecomment-377992372, or mute the thread https://github.com/notifications/unsubscribe-auth/AAktunGOj2FZ3cDwDHS73frfuo0FjJRqks5tkmT-gaJpZM4TDI70 .

-- 杨博 (Yang Bo)

Apr 02 '18 18:04 Atry

@raver119 The inplace version of ND4J operation is indeed super fast. It is 1.44 times faster than the ND4J's immutable version when performing a * b + c 100 times on 32x32x32 arrays, despite the fact that Compute.scala's immutable version is 43 times faster than ND4J's inplace version.

All the tests are running on a Titan X GPU.

Apr 02 '18 19:04 Atry

That's already something, thank you.

Please tell me, what OS was used, and what CUDA Toolkit version was used?

EDIT: And which Titan X generation was used? There were 2 different generations sharing the same X name. Which one you've used? M or P?

Apr 02 '18 19:04 raver119

Ubuntu 16.04 and CUDA 8.0 from this docker image: https://github.com/ThoughtWorksInc/scala-cuda/tree/sbt-openjdk8-cuda8.0-opencl-ubuntu16.04

Apr 02 '18 19:04 Atry

What's different on your local branch? I've tried to run your nvidia-gpu branch locally but I'm having some dependency issues. All blockingAwait methods appear to be unresolved, making me think you might have a local dependency somewhere. Can you update the branch so I can import it in an IDE?

A couple more things:

how are you running the Docker image?
you're also running ND4J v8, which is out of date but b/c we're testing primitive operations here it should be a negligible difference

Apr 02 '18 21:04 crockpotveggies

blockingAwait is marked red in IntelliJ, which is a bug in IntelliJ typer. The bug does not affect actual compilation.

Apr 03 '18 00:04 Atry

The reason why I was using 0.8 is the CUDA backend of ND4J 0.9.x is broken in sbt, even when compiling from a clean docker image.

https://github.com/deeplearning4j/nd4j/issues/2767

Apr 03 '18 00:04 Atry

Can you quickly give me the command you are using to run the program? I cannot access the class from SBT console On Mon, Apr 2, 2018 at 5:52 PM 杨博 (Yang Bo) [email protected] wrote:

The reason why I am using 0.8 is simply because the CUDA backend of ND4J 0.9.x is broken in sbt, even when compiling from a clear docker image.

deeplearning4j/nd4j#2767 https://github.com/deeplearning4j/nd4j/issues/2767

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ThoughtWorksInc/Compute.scala/issues/137#issuecomment-378093278, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYNr7sIE1flkii2AdBsXBioOtElgHDNks5tkse3gaJpZM4TDI70 .

Apr 03 '18 00:04 crockpotveggies

sbt 'benchmarks/Jmh/run Issue137'

The first run of the command may be fail due to sbt-jmh's bug. But retry would be good.

Run sbt 'benchmarks/Jmh/run -help for more flags

Apr 03 '18 01:04 Atry

Excuse my ignorance, all I'm getting is a packaged jar and there's no main class when I try to run it.

/home/justin/Projects/Compute.scala/benchmarks/target/scala-2.12/benchmarks_2.12-0.3.2-SNAPSHOT-jmh.jar

On Mon, Apr 2, 2018 at 6:12 PM 杨博 (Yang Bo) [email protected] wrote:

sbt 'benchmarks/Jmh/bgRun Issue137'

The first run of the command may be fail due to sbt-jmh's bug. But retry would be good.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ThoughtWorksInc/Compute.scala/issues/137#issuecomment-378096319, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYNr-z5Im6h0ty4ijK9TvpgEmML_g9lks5tksyDgaJpZM4TDI70 .

Apr 03 '18 01:04 crockpotveggies

2018-04-03 5:13 GMT+08:00 Justin Long [email protected]:

What's different on your local branch?

There were different workarounds for different OpenCL bugs from different vendors, but we now detect vendor at run-time, dynamically switching those workarounds.

The only difference now in nvidia-gpu branch is the library dependency of ND4J backend, because I don't know how to switch ND4J backend at runtime.

-- 杨博 (Yang Bo)

Apr 03 '18 02:04 Atry

I've also tried with no success: java -cp /home/justin/Projects/Compute.scala/benchmarks/target/scala-2.12/benchmarks_2.12-0.3.2-SNAPSHOT-jmh.jar com.thoughtworks.compute.benchmarks

On Mon, Apr 2, 2018 at 7:09 PM 杨博 (Yang Bo) [email protected] wrote:

2018-04-03 5:13 GMT+08:00 Justin Long [email protected]:

What's different on your local branch?

There were different workarounds for different OpenCL bugs form different vendors, but we now detect vendor at run-time dynamically switch those workaround.

The only difference now in nvidia-gpu branch is the library dependency of ND4J backend, because I don't know how to switch ND4J backend at runtime.

-- 杨博 (Yang Bo)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ThoughtWorksInc/Compute.scala/issues/137#issuecomment-378105038, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYNr8iDuf_xqyfPlQKpLijpjOM1YsQcks5tktnMgaJpZM4TDI70 .

Apr 03 '18 02:04 crockpotveggies

Imagine you do some reading of documentation, and instead of:

java -cp ...

do something like:

sbt benchmarks/Jmh/run ...

That's when it becomes interesting... :)

Apr 03 '18 02:04 Atry

Imagine you do some reading of documentation

Burden of reproducibility falls on you. The command you gave me is sbt 'benchmarks/Jmh/bgRun Issue137'. I ran that command exactly. The output is a JAR file. Here's my output:

$ sbt 'benchmarks/Jmh/bgRun Issue137'

[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /home/justin/Projects/Compute.scala/project
[info] Loading settings from build.sbt ...
[info] Loading settings from build.sbt ...
[info] Loading settings from build.sbt ...
[info] Loading settings from build.sbt ...
[info] Loading settings from build.sbt ...
[info] Loading settings from build.sbt ...
[info] Loading settings from build.sbt ...
[info] Loading settings from build.sbt ...
[info] Loading settings from build.sbt ...
[info] Loading settings from build.sbt ...
[info] Loading settings from build.sbt,version.sbt ...
[info] Set current project to compute-scala (in build file:/home/justin/Projects/Compute.scala/)
[info] Packaging /home/justin/Projects/Compute.scala/NDimensionalAffineTransform/target/scala-2.12/ndimensionalaffinetransform_2.12-0.3.2-SNAPSHOT.jar ...
[info] Packaging /home/justin/Projects/Compute.scala/Memory/target/scala-2.12/memory_2.12-0.3.2-SNAPSHOT.jar ...
[info] Done packaging.
[info] Done packaging.
[info] Packaging /home/justin/Projects/Compute.scala/Expressions/target/scala-2.12/expressions_2.12-0.3.2-SNAPSHOT.jar ...
[info] Done packaging.
[info] Packaging /home/justin/Projects/Compute.scala/OpenCLKernelBuilder/target/scala-2.12/openclkernelbuilder_2.12-0.3.2-SNAPSHOT.jar ...
[info] Done packaging.
[info] Packaging /home/justin/Projects/Compute.scala/Trees/target/scala-2.12/trees_2.12-0.3.2-SNAPSHOT.jar ...
[info] Packaging /home/justin/Projects/Compute.scala/OpenCL/target/scala-2.12/opencl_2.12-0.3.2-SNAPSHOT.jar ...
[info] Done packaging.
[info] Done packaging.
[info] Packaging /home/justin/Projects/Compute.scala/Tensors/target/scala-2.12/tensors_2.12-0.3.2-SNAPSHOT.jar ...
[info] Done packaging.
[info] Packaging /home/justin/Projects/Compute.scala/benchmarks/target/scala-2.12/benchmarks_2.12-0.3.2-SNAPSHOT.jar ...
[info] Packaging /home/justin/Projects/Compute.scala/benchmarks/target/scala-2.12/benchmarks_2.12-0.3.2-SNAPSHOT-tests.jar ...
[info] Done packaging.
Processing 24 classes from /home/justin/Projects/Compute.scala/benchmarks/target/scala-2.12/classes with "reflection" generator
[info] Done packaging.
Writing out Java source to /home/justin/Projects/Compute.scala/benchmarks/target/scala-2.12/src_managed/jmh and resources to /home/justin/Projects/Compute.scala/benchmarks/target/scala-2.12/resource_managed/jmh
[info] Compiling 1 Scala source and 37 Java sources to /home/justin/Projects/Compute.scala/benchmarks/target/scala-2.12/classes ...
[warn] /home/justin/Projects/Compute.scala/benchmarks/src/jmh/scala/com/thoughtworks/compute/benchmarks.scala:453:24: The outer reference in this type test cannot be checked at run time.
[warn]       final case class ConvolutionalLayer(weight: NonInlineTensor, bias: NonInlineTensor) {
[warn]                        ^
[warn] one warning found
[info] Done compiling.
[info] Packaging /home/justin/Projects/Compute.scala/benchmarks/target/scala-2.12/benchmarks_2.12-0.3.2-SNAPSHOT-jmh.jar ...
[info] Done packaging.
[success] Total time: 8 s, completed Apr 2, 2018 6:23:45 PM

If you can't give me something that's reproducible, that's very suspect. I see that you have since edited your answer to use run instead of bgRun. Next time, please give me a heads up when you make a change.

Apr 03 '18 02:04 crockpotveggies

Clarify:

bgRun starts a benchmark in background, which is suitable when you want to still use sbt shell during running.
run runs the benchmark and wait for finish. You should use run when it is the only command submitted to sbt batch mode.

Apr 03 '18 02:04 Atry

Compute.scala Compute.scala copied to clipboard

Clarify benchmarks?

Compute.scala
Compute.scala copied to clipboard