Since the PR was closed once I'd merged in my changes, I figured I'd open this to track current progress on getting this working.

This is a continuation of #58 .

Current status is that I have the job setup, however we currently fail on the fact Rscript is missing. I have opened https://github.com/nodejs/build/issues/821 to get this installed.

In the mean time I am going to create a branch in benchmarking with a change to not use Rscript, which we can then use for continued testing.

Aug 04 '17 09:08 gareth-ellis

This now seems to be working, and is ready for testing. @mcollina you mentioned you had some PRs to try it on.

At the moment it is set to test a PR with its impact on master.

There's four parameters available: BRANCH - defaults to master PULL_ID - the PR number CATEGORY - the folder of benchmarks you want to run from node/benchmark FILTER - to limit the selection of benchmarks in that folder.

Please take a look and let me know how it goes!

Example of output : https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/8/

URL to launch from: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/build?delay=0sec

NOTE: Be careful to understand how long the run you are submitting will take: Here is a guide that @AndreasMadsen wrote a year or so ago, each number (time (sec) ) that you intend to run needs multiplying by 60 (as the script will run 30 of the "old" and 30 of the "new" build.

Aug 10 '17 07:08 gareth-ellis

I'm currently on vacation, I'll get this going asap when I come back (1 week).

Aug 10 '17 09:08 mcollina

ProTip: To multiple by 60 just read the seconds as minutes.

Aug 10 '17 21:08 AndreasMadsen

Thoughts: currently its taking AGES to clone node core, What are thoughts on cloning once on the machine and then copying over to the workspace and doing a pull origin master?

Perhaps we could update the central clone once a week?

Investigating why the run @mcollina failed - will update with findings....

Aug 18 '17 20:08 gareth-ellis

How to: Running core benchmarks on Node.js CI

Current function

The current jenkins job aims at allowing the impact of a PR on a particular branch to be compared. For example how does PR XXXX impact performance on master?

How to configure the job

To start with visit https://ci.nodejs.org/job/benchmark-node-micro-benchmarks/build?delay=0sec Here you have four values to enter:

BRANCH -

this is which branch or tag to compare to, defaults to master

PULL_ID -

pull request to test These two variables provide the two copies of Node.js to compare - we build BRANCH, then take a copy and apply the diff of PULL_ID, and build again. We now have two copies of Node to compare

CATEGORY -

This is a category of benchmarks in the node/benchmark folder from core

FILTER -

this reduces further the benchmarks in scope and is includes benchmark where FILTER exists in the filename.

On running, a minimum of 30 iterations of each benchmark/variation per build, once they have all finished Rscript will summarise the results, suggesting improvement / regression as well as confidence intervals.

Aug 28 '17 19:08 gareth-ellis

@gareth-ellis It would probably make sense to PR this as a doc somewhere in the repo.

Sep 08 '17 18:09 mhdawson

@AndreasMadsen @mcollina @mscdex , if you have some cycles do you want to have a go with the job, using instructions above and see how you get on?

Also got a PR for the above doc with a view to getting something together we can send out to a wider audience and get people using this job and core benchmarks. (PR in #144 )

Sep 11 '17 19:09 gareth-ellis

I would love to see it in action but right now I don't have anything I would like to benchmark :(

Sep 11 '17 19:09 AndreasMadsen

I have started https://ci.nodejs.org/job/benchmark-node-micro-benchmarks/14/

Sep 13 '17 10:09 mcollina

I think the script should automatically rebase the branch, because a difference in the V8 engine could skew the results and provide wrong feedback. Maybe this should be a checkbox, so we can verify across different release lines.

I run the tests on my laptop in June and I got no regressions (https://github.com/nodejs/node/pull/12857#issuecomment-309666756) - this was at the time of V8 5.9.

However the branch is still using an extremely old V8 engine, and we got some very weird results running it now:

                                                   improvement confidence     p.value
 net/net-c2s-cork.js dur=5 type="buf" len=1024          2.56 %            0.330368195
 net/net-c2s-cork.js dur=5 type="buf" len=128          -5.73 %          * 0.010873902
 net/net-c2s-cork.js dur=5 type="buf" len=16           -0.34 %            0.869100857
 net/net-c2s-cork.js dur=5 type="buf" len=32            3.41 %            0.218088510
 net/net-c2s-cork.js dur=5 type="buf" len=4            -1.04 %            0.491132059
 net/net-c2s-cork.js dur=5 type="buf" len=512          -5.45 %          * 0.032530436
 net/net-c2s-cork.js dur=5 type="buf" len=64           -1.61 %            0.579751020
 net/net-c2s-cork.js dur=5 type="buf" len=8            -2.84 %            0.242407708
 net/net-c2s.js dur=5 type="asc" len=102400            -0.86 %            0.500537810
 net/net-c2s.js dur=5 type="asc" len=16777216          -0.73 %            0.762753668
 net/net-c2s.js dur=5 type="buf" len=102400             0.55 %            0.664537769
 net/net-c2s.js dur=5 type="buf" len=16777216           1.11 %            0.557571579
 net/net-c2s.js dur=5 type="utf" len=102400            -0.39 %            0.822950308
 net/net-c2s.js dur=5 type="utf" len=16777216           1.38 %            0.453838521
 net/net-pipe.js dur=5 type="asc" len=102400            0.41 %            0.712092837
 net/net-pipe.js dur=5 type="asc" len=16777216          0.81 %            0.775137756
 net/net-pipe.js dur=5 type="buf" len=102400            0.76 %            0.466008198
 net/net-pipe.js dur=5 type="buf" len=16777216          0.90 %            0.558028886
 net/net-pipe.js dur=5 type="utf" len=102400            0.54 %            0.671323747
 net/net-pipe.js dur=5 type="utf" len=16777216          3.82 %          * 0.032465465
 net/net-s2c.js dur=5 type="asc" len=102400             0.84 %            0.509765096
 net/net-s2c.js dur=5 type="asc" len=16777216          -2.78 %            0.217136214
 net/net-s2c.js dur=5 type="buf" len=102400             0.11 %            0.925209854
 net/net-s2c.js dur=5 type="buf" len=16777216           0.03 %            0.986008241
 net/net-s2c.js dur=5 type="utf" len=102400            -2.64 %            0.129138848
 net/net-s2c.js dur=5 type="utf" len=16777216           2.26 %            0.088823248
 net/tcp-raw-c2s.js dur=5 type="asc" len=102400         2.51 %            0.292369145
 net/tcp-raw-c2s.js dur=5 type="asc" len=16777216      -0.08 %            0.972487579
 net/tcp-raw-c2s.js dur=5 type="buf" len=102400         6.07 %            0.052679575
 net/tcp-raw-c2s.js dur=5 type="buf" len=16777216       1.04 %            0.562458368
 net/tcp-raw-c2s.js dur=5 type="utf" len=102400         5.49 %            0.217685094
 net/tcp-raw-c2s.js dur=5 type="utf" len=16777216      -0.06 %            0.884432672
 net/tcp-raw-pipe.js dur=5 type="asc" len=102400        0.93 %            0.731389654
 net/tcp-raw-pipe.js dur=5 type="asc" len=16777216    -18.48 %            0.304436728
 net/tcp-raw-pipe.js dur=5 type="buf" len=102400       -0.78 %            0.781388921
 net/tcp-raw-pipe.js dur=5 type="buf" len=16777216     75.96 %            0.165276028
 net/tcp-raw-pipe.js dur=5 type="utf" len=102400       -5.66 %         ** 0.006928205
 net/tcp-raw-pipe.js dur=5 type="utf" len=16777216     -1.05 %            0.669799761
 net/tcp-raw-s2c.js dur=5 type="asc" len=102400         0.48 %            0.848273086
 net/tcp-raw-s2c.js dur=5 type="asc" len=16777216      -1.09 %            0.664140660
 net/tcp-raw-s2c.js dur=5 type="buf" len=102400         5.85 %          * 0.046938660
 net/tcp-raw-s2c.js dur=5 type="buf" len=16777216      -0.38 %            0.884173926
 net/tcp-raw-s2c.js dur=5 type="utf" len=102400        -4.80 %            0.174611084
 net/tcp-raw-s2c.js dur=5 type="utf" len=16777216      -0.71 %            0.364776237
Notifying upstream projects of job completion
Finished: SUCCESS

I kicked off the job again: https://ci.nodejs.org/job/benchmark-node-micro-benchmarks/15/.

It would be good to access the raw data, so I can use R to plot graphs.. is that pushed somewhere?

Sep 14 '17 10:09 mcollina

I think the script should automatically rebase the branch, because a difference in the V8 engine could skew the results and provide wrong feedback. Maybe this should be a checkbox, so we can verify across different release lines.

The script checkouts the $BRANCH (master) build node, then cherry-pick $PULL_ID on top and build again. So it should be rebased.

However the branch is still using an extremely old V8 engine, and we got some very weird results running it now:

What are the wired results? You have to remember that there is a small risk for a false positive and with many independent test that risk becomes large. At only 5% confidence (one star), the risk of getting 5 false positive in 44 tests is 4.590962%.

rplot

plot is only valid for 44 independent test.

plot code

library(ggplot2);
c1 = data.frame(
    false.positive=0:44,
    properbility=dbinom(0:44, 44, 0.05),
    confidence='*'
)
c2 = data.frame(
    false.positive=0:44,
    properbility=dbinom(0:44, 44, 0.01),
    confidence='**'
)
c3 = data.frame(
    false.positive=0:44,
    properbility=dbinom(0:44, 44, 0.001),
    confidence='***'
)
c = rbind(rbind(c1, c2), c3)

p = ggplot(c, aes(x=false.positive))
  + geom_bar(aes(y=properbility, fill=confidence), stat="identity")
  + facet_grid(confidence ~ .)
  + scale_x_continuous(breaks=0:44, limit=c(-0.5,8)))

print(p)

Sep 14 '17 11:09 AndreasMadsen

My bad. So, this is fine.

The script works beautifully then. I think we should document it in the main repo, it would be of great help to everyone.

Sep 14 '17 11:09 mcollina

Excellent - I will see what I can do about providing the raw data.

Yes, we plan to get the info out to collaborators to start using, but thought it would be good to have some people test first :)

I'll add a bit of a description similar to what @AndreasMadsen wrote to explain how it works, and include in the doc in #144 .

Any other issues or quirks you noticed should be documented?

Sep 14 '17 11:09 gareth-ellis

@gareth-ellis +1 to all that you just said.

Running all the http benchmarks without a filter takes forever (44 hours on my laptop and I got to 60%). We should not allow that, or strongly suggest that they need to add a filter. Running a job for 3-4 days does not sound right. IMHO The filter might well be mandatory for this script.

Posting the result to the PR would be a huge plus, but I do not know if that is easy to do.

Sep 14 '17 11:09 mcollina

Running all the http benchmarks without a filter takes forever (44 hours on my laptop and I got to 60%). We should not allow that, or strongly suggest that they need to add a filter. Running a job for 3-4 days does not sound right. IMHO The filter might well be mandatory for this script.

I agree. Statistically, due to the false positive risk, it is also not very useful to run that many tests. I think we should just check the expected benchmark time using a precompiled list as shown earlier https://github.com/nodejs/benchmarking/pull/58#issuecomment-251408048.

Sep 14 '17 13:09 AndreasMadsen

@mcollina , for getting to the raw data, if you click on your job, you should be able to open the workspace, and then browse to e.g. https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/ws/benchmarking/experimental/benchmarks/community-benchmark/node/output.csv.

I'll modify the script to move it to the top level, so there is less moving around. I think the gotcha here could be the workspace may be cleared before each run. I will rename the file then I can be sure. Otherwise another option may be copying it to a folder on the benchmarking machine and have it synced over to the website for download. We do something similar for the charts (at https://benchmarking.nodejs.org) so it should in theory be possible. Then we could post a link in the job output to where the file will be.

For posting to a PR, I know there's some bots - in the org - are there any i can pass data to post to a PR? In the job I already have the PR number, so I have the information to send and where it needs to go.

Sep 18 '17 10:09 gareth-ellis

both approach make sense. Even uploading it to a gist on github (together with a chart, maybe), would be fantastic. Something that last for a bit of time with some meta-info (shas being tested).

Sep 18 '17 10:09 mcollina

OK, will look into that.

In the meantime, csvs now get saved in the root of the ws with date/timestamp:

https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/ws/

Sep 18 '17 10:09 gareth-ellis

Are we ready to announce to the rest of the team they can start using this or is there any other testing feedback we should get first.

Oct 23 '17 19:10 mhdawson

@mhdawson I don't think there is anything to lose. It is not like making this public can cause anything else to fail or break.

Oct 23 '17 19:10 AndreasMadsen

@gareth-ellis is there a reason that RUNS wasn't exposed as a parameter to the job?

Nov 09 '17 00:11 refack

the main reason was that we were using the default number set of iterations set per benchmark, I can add the RUNS parameter if you like, just need to appreciate that if you set the number too low, the results may not be particularly useful, and if set too high and with enough benchmarks in scope it could take a long time to run

Nov 09 '17 06:11 gareth-ellis

I looked at the code and saw that the script will use it, so I added it, because we have a PR (https://github.com/nodejs/node/pull/16888#issuecomment-343005812) that is very noisy, and AFAICT it would give better results with more runs but shorter iterations.

Nov 09 '17 14:11 refack

AFAICT it would give better results with more runs but shorter iterations

In theory, it will only make a tiny difference. After 30-100 runs, the difference should be less than the numerical precision. In practice, decreasing the number of iteration can change the optimization behavior.

Nov 09 '17 14:11 AndreasMadsen

Look at this distribution: unpaired t test data I think all the outliers are cases where the GC kicks in, and it's not what we're trying to look at.

Nov 09 '17 15:11 refack

@refack what are the outliers?

Nov 10 '17 09:11 AndreasMadsen

@refack I have made a long comment in the other issue (https://github.com/nodejs/node/pull/16888#issuecomment-343434492), hopefully that will clarify things.

Nov 10 '17 10:11 AndreasMadsen

Gareth opened https://github.com/nodejs/node/issues/17359 to let collaborators know about it.

Dec 04 '17 20:12 mhdawson

Anything more to be discussed in this issue ? The benchmark job is now available, publicized and initial issues resolved.

Jan 15 '18 20:01 mhdawson

benchmarking
benchmarking copied to clipboard

Status of running node core benchmarks

How to: Running core benchmarks on Node.js CI

Current function