benchmarking
benchmarking copied to clipboard
Status of running node core benchmarks
Since the PR was closed once I'd merged in my changes, I figured I'd open this to track current progress on getting this working.
This is a continuation of #58 .
Current status is that I have the job setup, however we currently fail on the fact Rscript is missing. I have opened https://github.com/nodejs/build/issues/821 to get this installed.
In the mean time I am going to create a branch in benchmarking with a change to not use Rscript, which we can then use for continued testing.
This now seems to be working, and is ready for testing. @mcollina you mentioned you had some PRs to try it on.
At the moment it is set to test a PR with its impact on master.
There's four parameters available: BRANCH - defaults to master PULL_ID - the PR number CATEGORY - the folder of benchmarks you want to run from node/benchmark FILTER - to limit the selection of benchmarks in that folder.
Please take a look and let me know how it goes!
Example of output : https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/8/
URL to launch from: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/build?delay=0sec
NOTE: Be careful to understand how long the run you are submitting will take: Here is a guide that @AndreasMadsen wrote a year or so ago, each number (time (sec) ) that you intend to run needs multiplying by 60 (as the script will run 30 of the "old" and 30 of the "new" build.
I'm currently on vacation, I'll get this going asap when I come back (1 week).
ProTip: To multiple by 60 just read the seconds as minutes.
Thoughts: currently its taking AGES to clone node core, What are thoughts on cloning once on the machine and then copying over to the workspace and doing a pull origin master?
Perhaps we could update the central clone once a week?
Investigating why the run @mcollina failed - will update with findings....
How to: Running core benchmarks on Node.js CI
Current function
The current jenkins job aims at allowing the impact of a PR on a particular branch to be compared. For example how does PR XXXX impact performance on master?
How to configure the job
To start with visit https://ci.nodejs.org/job/benchmark-node-micro-benchmarks/build?delay=0sec Here you have four values to enter:
BRANCH -
this is which branch or tag to compare to, defaults to master
PULL_ID -
pull request to test These two variables provide the two copies of Node.js to compare - we build BRANCH, then take a copy and apply the diff of PULL_ID, and build again. We now have two copies of Node to compare
CATEGORY -
This is a category of benchmarks in the node/benchmark folder from core
FILTER -
this reduces further the benchmarks in scope and is includes benchmark where FILTER exists in the filename.
On running, a minimum of 30 iterations of each benchmark/variation per build, once they have all finished Rscript will summarise the results, suggesting improvement / regression as well as confidence intervals.
@gareth-ellis It would probably make sense to PR this as a doc somewhere in the repo.
@AndreasMadsen @mcollina @mscdex , if you have some cycles do you want to have a go with the job, using instructions above and see how you get on?
Also got a PR for the above doc with a view to getting something together we can send out to a wider audience and get people using this job and core benchmarks. (PR in #144 )
I would love to see it in action but right now I don't have anything I would like to benchmark :(
I have started https://ci.nodejs.org/job/benchmark-node-micro-benchmarks/14/
I think the script should automatically rebase the branch, because a difference in the V8 engine could skew the results and provide wrong feedback. Maybe this should be a checkbox, so we can verify across different release lines.
I run the tests on my laptop in June and I got no regressions (https://github.com/nodejs/node/pull/12857#issuecomment-309666756) - this was at the time of V8 5.9.
However the branch is still using an extremely old V8 engine, and we got some very weird results running it now:
improvement confidence p.value
net/net-c2s-cork.js dur=5 type="buf" len=1024 2.56 % 0.330368195
net/net-c2s-cork.js dur=5 type="buf" len=128 -5.73 % * 0.010873902
net/net-c2s-cork.js dur=5 type="buf" len=16 -0.34 % 0.869100857
net/net-c2s-cork.js dur=5 type="buf" len=32 3.41 % 0.218088510
net/net-c2s-cork.js dur=5 type="buf" len=4 -1.04 % 0.491132059
net/net-c2s-cork.js dur=5 type="buf" len=512 -5.45 % * 0.032530436
net/net-c2s-cork.js dur=5 type="buf" len=64 -1.61 % 0.579751020
net/net-c2s-cork.js dur=5 type="buf" len=8 -2.84 % 0.242407708
net/net-c2s.js dur=5 type="asc" len=102400 -0.86 % 0.500537810
net/net-c2s.js dur=5 type="asc" len=16777216 -0.73 % 0.762753668
net/net-c2s.js dur=5 type="buf" len=102400 0.55 % 0.664537769
net/net-c2s.js dur=5 type="buf" len=16777216 1.11 % 0.557571579
net/net-c2s.js dur=5 type="utf" len=102400 -0.39 % 0.822950308
net/net-c2s.js dur=5 type="utf" len=16777216 1.38 % 0.453838521
net/net-pipe.js dur=5 type="asc" len=102400 0.41 % 0.712092837
net/net-pipe.js dur=5 type="asc" len=16777216 0.81 % 0.775137756
net/net-pipe.js dur=5 type="buf" len=102400 0.76 % 0.466008198
net/net-pipe.js dur=5 type="buf" len=16777216 0.90 % 0.558028886
net/net-pipe.js dur=5 type="utf" len=102400 0.54 % 0.671323747
net/net-pipe.js dur=5 type="utf" len=16777216 3.82 % * 0.032465465
net/net-s2c.js dur=5 type="asc" len=102400 0.84 % 0.509765096
net/net-s2c.js dur=5 type="asc" len=16777216 -2.78 % 0.217136214
net/net-s2c.js dur=5 type="buf" len=102400 0.11 % 0.925209854
net/net-s2c.js dur=5 type="buf" len=16777216 0.03 % 0.986008241
net/net-s2c.js dur=5 type="utf" len=102400 -2.64 % 0.129138848
net/net-s2c.js dur=5 type="utf" len=16777216 2.26 % 0.088823248
net/tcp-raw-c2s.js dur=5 type="asc" len=102400 2.51 % 0.292369145
net/tcp-raw-c2s.js dur=5 type="asc" len=16777216 -0.08 % 0.972487579
net/tcp-raw-c2s.js dur=5 type="buf" len=102400 6.07 % 0.052679575
net/tcp-raw-c2s.js dur=5 type="buf" len=16777216 1.04 % 0.562458368
net/tcp-raw-c2s.js dur=5 type="utf" len=102400 5.49 % 0.217685094
net/tcp-raw-c2s.js dur=5 type="utf" len=16777216 -0.06 % 0.884432672
net/tcp-raw-pipe.js dur=5 type="asc" len=102400 0.93 % 0.731389654
net/tcp-raw-pipe.js dur=5 type="asc" len=16777216 -18.48 % 0.304436728
net/tcp-raw-pipe.js dur=5 type="buf" len=102400 -0.78 % 0.781388921
net/tcp-raw-pipe.js dur=5 type="buf" len=16777216 75.96 % 0.165276028
net/tcp-raw-pipe.js dur=5 type="utf" len=102400 -5.66 % ** 0.006928205
net/tcp-raw-pipe.js dur=5 type="utf" len=16777216 -1.05 % 0.669799761
net/tcp-raw-s2c.js dur=5 type="asc" len=102400 0.48 % 0.848273086
net/tcp-raw-s2c.js dur=5 type="asc" len=16777216 -1.09 % 0.664140660
net/tcp-raw-s2c.js dur=5 type="buf" len=102400 5.85 % * 0.046938660
net/tcp-raw-s2c.js dur=5 type="buf" len=16777216 -0.38 % 0.884173926
net/tcp-raw-s2c.js dur=5 type="utf" len=102400 -4.80 % 0.174611084
net/tcp-raw-s2c.js dur=5 type="utf" len=16777216 -0.71 % 0.364776237
Notifying upstream projects of job completion
Finished: SUCCESS
I kicked off the job again: https://ci.nodejs.org/job/benchmark-node-micro-benchmarks/15/.
It would be good to access the raw data, so I can use R to plot graphs.. is that pushed somewhere?
I think the script should automatically rebase the branch, because a difference in the V8 engine could skew the results and provide wrong feedback. Maybe this should be a checkbox, so we can verify across different release lines.
The script checkouts the $BRANCH (master) build node, then cherry-pick $PULL_ID on top and build again. So it should be rebased.
However the branch is still using an extremely old V8 engine, and we got some very weird results running it now:
What are the wired results? You have to remember that there is a small risk for a false positive and with many independent test that risk becomes large. At only 5% confidence (one star), the risk of getting 5 false positive in 44 tests is 4.590962%.

plot is only valid for 44 independent test.
plot code
library(ggplot2);
c1 = data.frame(
false.positive=0:44,
properbility=dbinom(0:44, 44, 0.05),
confidence='*'
)
c2 = data.frame(
false.positive=0:44,
properbility=dbinom(0:44, 44, 0.01),
confidence='**'
)
c3 = data.frame(
false.positive=0:44,
properbility=dbinom(0:44, 44, 0.001),
confidence='***'
)
c = rbind(rbind(c1, c2), c3)
p = ggplot(c, aes(x=false.positive))
+ geom_bar(aes(y=properbility, fill=confidence), stat="identity")
+ facet_grid(confidence ~ .)
+ scale_x_continuous(breaks=0:44, limit=c(-0.5,8)))
print(p)
My bad. So, this is fine.
The script works beautifully then. I think we should document it in the main repo, it would be of great help to everyone.
Excellent - I will see what I can do about providing the raw data.
Yes, we plan to get the info out to collaborators to start using, but thought it would be good to have some people test first :)
I'll add a bit of a description similar to what @AndreasMadsen wrote to explain how it works, and include in the doc in #144 .
Any other issues or quirks you noticed should be documented?
@gareth-ellis +1 to all that you just said.
Running all the http benchmarks without a filter takes forever (44 hours on my laptop and I got to 60%). We should not allow that, or strongly suggest that they need to add a filter. Running a job for 3-4 days does not sound right. IMHO The filter might well be mandatory for this script.
Posting the result to the PR would be a huge plus, but I do not know if that is easy to do.
Running all the http benchmarks without a filter takes forever (44 hours on my laptop and I got to 60%). We should not allow that, or strongly suggest that they need to add a filter. Running a job for 3-4 days does not sound right. IMHO The filter might well be mandatory for this script.
I agree. Statistically, due to the false positive risk, it is also not very useful to run that many tests. I think we should just check the expected benchmark time using a precompiled list as shown earlier https://github.com/nodejs/benchmarking/pull/58#issuecomment-251408048.
@mcollina , for getting to the raw data, if you click on your job, you should be able to open the workspace, and then browse to e.g. https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/ws/benchmarking/experimental/benchmarks/community-benchmark/node/output.csv.
I'll modify the script to move it to the top level, so there is less moving around. I think the gotcha here could be the workspace may be cleared before each run. I will rename the file then I can be sure. Otherwise another option may be copying it to a folder on the benchmarking machine and have it synced over to the website for download. We do something similar for the charts (at https://benchmarking.nodejs.org) so it should in theory be possible. Then we could post a link in the job output to where the file will be.
For posting to a PR, I know there's some bots - in the org - are there any i can pass data to post to a PR? In the job I already have the PR number, so I have the information to send and where it needs to go.
both approach make sense. Even uploading it to a gist on github (together with a chart, maybe), would be fantastic. Something that last for a bit of time with some meta-info (shas being tested).
OK, will look into that.
In the meantime, csvs now get saved in the root of the ws with date/timestamp:
https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/ws/
Are we ready to announce to the rest of the team they can start using this or is there any other testing feedback we should get first.
@mhdawson I don't think there is anything to lose. It is not like making this public can cause anything else to fail or break.
@gareth-ellis is there a reason that RUNS wasn't exposed as a parameter to the job?
the main reason was that we were using the default number set of iterations set per benchmark, I can add the RUNS parameter if you like, just need to appreciate that if you set the number too low, the results may not be particularly useful, and if set too high and with enough benchmarks in scope it could take a long time to run
I looked at the code and saw that the script will use it, so I added it, because we have a PR (https://github.com/nodejs/node/pull/16888#issuecomment-343005812) that is very noisy, and AFAICT it would give better results with more runs but shorter iterations.
AFAICT it would give better results with more runs but shorter iterations
In theory, it will only make a tiny difference. After 30-100 runs, the difference should be less than the numerical precision. In practice, decreasing the number of iteration can change the optimization behavior.
Look at this distribution:
I think all the outliers are cases where the GC kicks in, and it's not what we're trying to look at.
@refack what are the outliers?
@refack I have made a long comment in the other issue (https://github.com/nodejs/node/pull/16888#issuecomment-343434492), hopefully that will clarify things.
Gareth opened https://github.com/nodejs/node/issues/17359 to let collaborators know about it.
Anything more to be discussed in this issue ? The benchmark job is now available, publicized and initial issues resolved.