Fix NPS measurement for TC scaling
This PR modifies the NPS measurement for TC scaling to more closely resemble actual testing conditions. In particular, it addresses the point raised in https://github.com/official-stockfish/fishtest/issues/2077
Currently we run one process with a bench and one process with the search of n-1 threads. This doesn't account for these RAM bandwidth limitations discussed and therefore the measured nps is far faster than the real nps.
Instead, this PR runs a bench process for each active core and takes the average NPS.
Tested with https://tests.stockfishchess.org/tests/view/66729718602682471b065101?show_task=66 Works as expected
I think that's reasonable direction. Things to consider (idk the right answer)
- this will penalise SMP tests, where the actual nps will be higher than the one measured in this way. Could be solved by doing some SMP measurement for SMP tests.
- I have observed that on very large core workers the 1 second test might actually not be such a good measurement, as the system is spawning engines and only once everything is running the measurement becomes stable.
- This effectively changes the TC for the progression test, so will have some effect there. Maybe that's something to consider merging shortly after release (i.e. when we usually update the reference nps?).
master vs PR
@Disservin the workers look good with the PR. As highlighted by @vondele, merging the PR will change the TC for system with high concurrency, with a jump in the PT.
New commit with: simplification with concurrent.futures, proper exception management, bench for SMP use case.
MNps master vs PR for both normal test (thread=1) and SMP test (threads=8). As expected with master there is not a difference for threads > 1.
| code | test | Dual Xeon bmi2 | Zen 4 vnni-256 | core i7 popcnt |
|---|---|---|---|---|
| concurrency | virtual cores | 48 | 16 | 8 |
| master | 1 thread | 0.21 | 0.36 | 0.25 |
| master | SMP 8 threads | 0.21 | 0.37 | 0.26 |
| PR | 1 thread | 0.13 | 0.15 | 0.18 |
| PR | SMP 8 threads | 0.24 | 0.44 | 0.40 |
I experimented a bit with stockfish speedtest as potential replacement of stockfish bench for the worker nps benchmark, but I didn't find a real difference in precision (ratio stdev / average) when setting a comparable time for the 2 benchmarks.
I have these concerns in using speedtest (I didn't follow the development; I think that I'm missing something):
- only
benchcan verify the signature of the engine benchdoes the same deterministic computation for any worker (with 1 thread)speedtestshould be started with a normalized time to do comparable computations on different workers. If the normalized time is computed with the first run ofbench(validation signature), we still have a bias onbench- on a powerful CPU
benchat depth 13 takes less than 1 second. If I'm not wrongspeedtestseems to take only an integer parameter for the time, so the normalization of the time between workers requires a longer computation
Here are my tests:
- core i7 3770k -
benchat depth 13 andspeedtestat equivalent time
Click to view
$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 13 20
run sf_base sf_test diff
1 642062 638495 -3567
2 637904 645367 +7463
3 645669 644462 -1207
4 635846 645669 +9823
5 635552 641465 +5913
6 627171 637020 +9849
7 642361 640868 -1493
8 636139 638791 +2652
9 639976 646577 +6601
10 636139 635260 -879
11 636432 640273 +3841
12 642361 643860 +1499
13 636139 645367 +9228
14 635260 637020 +1760
15 637315 648401 +11086
16 634091 641166 +7075
17 637904 645367 +7463
18 636726 645367 +8641
19 643560 643860 +300
20 641763 643860 +2097
sf_base = 638018 +/- 1820 (95%)
sf_test = 642425 +/- 1609 (95%)
diff = 4407 +/- 1951 (95%)
speedup = 0.691% +/- 0.306% (95%)
real 0m54.408s
user 1m45.462s
sys 0m1.335s
average 638018.5 642425.75 4407.25
stdev.s 4154.695562 3673.531713 4451.640122
ratio 0.006511873 0.00571822 0.006953274
$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 3 20
run sf_base sf_test diff
1 526415 528228 +1813
2 523249 529589 +6340
3 521034 525763 +4729
4 521139 524121 +2982
5 523099 522704 -395
6 517411 522574 +5163
7 518389 524723 +6334
8 521461 527732 +6271
9 527087 526548 -539
10 519574 528365 +8791
11 521661 529280 +7619
12 525403 525319 -84
13 520944 518950 -1994
14 517173 526080 +8907
15 516447 518368 +1921
16 519563 522172 +2609
17 519312 518803 -509
18 518095 523615 +5520
19 519267 524004 +4737
20 513206 522394 +9188
sf_base = 520496 +/- 1508 (95%)
sf_test = 524466 +/- 1477 (95%)
diff = 3970 +/- 1534 (95%)
speedup = 0.763% +/- 0.295% (95%)
real 1m12.916s
user 2m23.886s
sys 0m2.049s
average 520496.45 524466.6 3970.15
stdev.s 3441.513796 3370.420345 3500.21296
ratio 0.006611983 0.006426377 0.006699209
- ryzen 7 4880u -
benchat depth 13 andspeedtestat equivalent time
Click to view
$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 13 20
run sf_base sf_test diff
1 1139793 1110426 -29367
2 1139793 1136974 -2819
3 1153135 1167781 +14646
4 1166793 1159924 -6869
5 1168771 1164822 -3949
6 1153135 1146425 -6710
7 1151210 1159924 +8714
8 1143573 1153135 +9562
9 1147379 1164822 +17443
10 1149291 1158949 +9658
11 1154100 1137912 -16188
12 1140736 1152172 +11436
13 1136037 1141680 +5643
14 1134169 1136974 +2805
15 1141680 1135102 -6578
16 1122172 1120349 -1823
17 1128600 1120349 -8251
18 1110426 1099800 -10626
19 1121260 1112217 -9043
20 1101557 1122172 +20615
sf_base = 1140180 +/- 7505 (95%)
sf_test = 1140095 +/- 8948 (95%)
diff = -85 +/- 5437 (95%)
speedup = -0.007% +/- 0.477% (95%)
real 0m27.261s
user 0m52.480s
sys 0m2.077s
average 1140180.5 1140095.45 -85.05
stdev.s 17125.97215 20418.04072 12406.09801
ratio 0.015020404 0.017909063 0.010881225
$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 1.4 20
run sf_base sf_test diff
1 1002991 977770 -25221
2 980706 988994 +8288
3 980046 976723 -3323
4 1004739 983160 -21579
5 982322 989259 +6937
6 986482 985358 -1124
7 984034 976997 -7037
8 1002855 988772 -14083
9 975933 962120 -13813
10 992360 965047 -27313
11 974111 983993 +9882
12 1004514 992087 -12427
13 970015 991746 +21731
14 956093 978705 +22612
15 965092 982835 +17743
16 954497 940059 -14438
17 968743 956396 -12347
18 956676 955077 -1599
19 947581 945381 -2200
20 942819 950398 +7579
sf_base = 976630 +/- 8384 (95%)
sf_test = 973543 +/- 7232 (95%)
diff = -3086 +/- 6517 (95%)
speedup = -0.316% +/- 0.667% (95%)
real 0m23.242s
user 0m40.157s
sys 0m2.577s
average 976630.45 973543.85 -3086.6
stdev.s 19129.792 16503.11905 14869.88728
ratio 0.019587544 0.016951593 0.015249803
- dual xeon e5-2680v3 -
benchat depth 13 andspeedtestat equivalent time
Click to view
$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 13 20
run sf_base sf_test diff
1 582411 611054 +28643
2 600936 608627 +7691
3 556108 567551 +11443
4 573451 595487 +22036
5 586872 575605 -11267
6 603566 558586 -44980
7 587372 579718 -7654
8 596518 582411 -14107
9 612411 602775 -9636
10 583396 590642 +7246
11 598849 590389 -8460
12 582165 566851 -15314
13 589884 586373 -3511
14 553208 583890 +30682
15 592927 601986 +9059
16 578745 564300 -14445
17 578259 585377 +7118
18 551219 565921 +14702
19 578988 596002 +17014
20 599630 596002 -3628
sf_base = 584345 +/- 7273 (95%)
sf_test = 585477 +/- 6744 (95%)
diff = 1131 +/- 7880 (95%)
speedup = 0.194% +/- 1.349% (95%)
real 0m58.222s
user 1m51.579s
sys 0m3.298s
average 584345.75 585477.35 1131.6
stdev.s 16597.09802 15389.8147 17981.80839
ratio 0.028402873 0.026285927 0.030742782
$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 3 20
run sf_base sf_test diff
1 493811 487185 -6626
2 461326 498548 +37222
3 483797 483261 -536
4 477423 444543 -32880
5 492298 488488 -3810
6 491763 500013 +8250
7 480248 471136 -9112
8 491465 499068 +7603
9 489841 490662 +821
10 477191 496783 +19592
11 486842 496983 +10141
12 481331 482163 +832
13 476472 491636 +15164
14 469090 478852 +9762
15 483396 491132 +7736
16 493891 486311 -7580
17 497079 491734 -5345
18 493707 493684 -23
19 484103 493925 +9822
20 461060 477934 +16874
sf_base = 483306 +/- 4615 (95%)
sf_test = 487202 +/- 5555 (95%)
diff = 3895 +/- 6174 (95%)
speedup = 0.806% +/- 1.278% (95%)
real 1m12.835s
user 2m19.874s
sys 0m4.604s
average 483306.7 487202.05 3895.35
stdev.s 10531.58584 12675.80741 14088.20323
ratio 0.021790689 0.026017558 0.029032615
- core i7 3770k -
benchat depth 20 andspeedtestat equivalent time
Click to view
$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 20 10
run sf_base sf_test diff
1 814720 820250 +5530
2 821560 824853 +3293
3 815993 820013 +4020
4 812556 820033 +7477
5 815719 825736 +10017
6 825214 834174 +8960
7 802381 805980 +3599
8 811954 819854 +7900
9 823653 825154 +1501
10 825094 829163 +4069
sf_base = 816884 +/- 4453 (95%)
sf_test = 822521 +/- 4624 (95%)
diff = 5636 +/- 1735 (95%)
speedup = 0.690% +/- 0.212% (95%)
real 7m59.444s
user 15m55.181s
sys 0m0.667s
average 816884.4 822521 5636.6
stdev.s 7185.516531 7460.7933 2800.248687
ratio 0.008796246 0.009070642 0.003416176
$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 48 10
run sf_base sf_test diff
1 500816 512525 +11709
2 505838 509329 +3491
3 498995 504984 +5989
4 502396 506823 +4427
5 496036 503498 +7462
6 496789 504979 +8190
7 491413 491360 -53
8 500142 508236 +8094
9 492447 494211 +1764
10 497343 496064 -1279
sf_base = 498221 +/- 2723 (95%)
sf_test = 503200 +/- 4340 (95%)
diff = 4979 +/- 2528 (95%)
speedup = 0.999% +/- 0.508% (95%)
real 9m25.012s
user 18m48.269s
sys 0m1.121s
average 498221.5 503200.9 4979.4
stdev.s 4393.888767 7002.810554 4080.178297
ratio 0.008819147 0.01391653 0.008148766
- ryzen 7 4880u -
benchat depth 20 andspeedtestat equivalent time
Click to view
$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 20 10
run sf_base sf_test diff
1 1420699 1420462 -237
2 1416137 1449753 +33616
3 1433292 1435595 +2303
4 1453538 1466406 +12868
5 1402336 1447775 +45439
6 1429611 1438331 +8720
7 1438026 1435352 -2674
8 1457781 1471170 +13389
9 1454472 1437174 -17298
10 1451860 1433111 -18749
sf_base = 1435775 +/- 11666 (95%)
sf_test = 1443512 +/- 9650 (95%)
diff = 7737 +/- 12533 (95%)
speedup = 0.539% +/- 0.873% (95%)
real 4m22.872s
user 8m41.477s
sys 0m1.139s)
average 1435775.2 1443512.9 7737.7
stdev.s 18822.04371 15570.57138 20221.45002
ratio 0.013109325 0.010786583 0.014046146
$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 27 10
run sf_base sf_test diff
1 961782 968902 +7120
2 969845 960509 -9336
3 921825 951732 +29907
4 948890 948322 -568
5 952444 956790 +4346
6 951687 939523 -12164
7 954338 970641 +16303
8 869486 879096 +9610
9 947384 940568 -6816
10 955128 954831 -297
sf_base = 943280 +/- 17796 (95%)
sf_test = 947091 +/- 16150 (95%)
diff = 3810 +/- 7891 (95%)
speedup = 0.404% +/- 0.837% (95%)
real 5m5.283s
user 10m8.288s
sys 0m1.521s
average 943280.9 947091.4 3810.5
stdev.s 28712.21306 26056.65103 12732.04205
ratio 0.030438667 0.027512288 0.013470407
- dual xeon e5-2680v3 -
benchat depth 20 andspeedtestat equivalent time
Click to view
$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 20 10
run sf_base sf_test diff
1 768409 750526 -17883
2 753189 757913 +4724
3 759048 748640 -10408
4 738689 731056 -7633
5 738770 752722 +13952
6 751223 746582 -4641
7 748623 748772 +149
8 753908 756192 +2284
9 747733 752005 +4272
10 741253 735840 -5413
sf_base = 750084 +/- 5789 (95%)
sf_test = 748024 +/- 5259 (95%)
diff = -2059 +/- 5602 (95%)
speedup = -0.275% +/- 0.747% (95%)
real 8m28.918s
user 16m50.191s
sys 0m2.022s
average 750084.5 748024.8 -2059.7
stdev.s 9340.599222 8485.735165 9038.620667
ratio 0.012452729 0.01134419 0.012066704
$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 50 10
run sf_base sf_test diff
1 482089 479438 -2651
2 475337 463893 -11444
3 455283 466308 +11025
4 475370 463149 -12221
5 479778 470888 -8890
6 465646 473220 +7574
7 473976 474571 +595
8 468684 464385 -4299
9 465412 468351 +2939
10 461725 456763 -4962
sf_base = 470330 +/- 5229 (95%)
sf_test = 468096 +/- 4091 (95%)
diff = -2233 +/- 4834 (95%)
speedup = -0.475% +/- 1.028% (95%)
real 9m30.372s
user 18m57.413s
sys 0m2.675s
average 470330 468096.6 -2233.4
stdev.s 8436.678388 6601.562293 7799.628299
ratio 0.017937785 0.014102991 0.016622778
speedtest has a slightly higher ratio of the benches between CPU, the ratio is ferly steady for either bench or speedtest from depth 13 to depth 20
| i7 3770k | ryzen7 4880u | dual xeon e5-2680v3 | i7 3770k | ryzen7 4880u | dual xeon e5-2680v3 | |
|---|---|---|---|---|---|---|
| bench 13 | 638018 | 1140180 | 584345 | 1 | 1.787065569 | 0.915875414 |
| speedtest | 520496 | 976630 | 483306 | 1 | 1.876344871 | 0.928548923 |
| bench 20 | 816884 | 1435775 | 750084 | 1 | 1.757624094 | 0.918225844 |
| speedtest | 498221 | 943280 | 470330 | 1 | 1.893296348 | 0.944018819 |
- this will penalise SMP tests, where the actual nps will be higher than the one measured in this way. Could be solved by doing some SMP measurement for SMP tests.
Done.
- I have observed that on very large core workers the 1 second test might actually not be such a good measurement, as the system is spawning engines and only once everything is running the measurement becomes stable.
We can run bench a second time with a depth > 13, the code change is easy. We should compute the references values with the new depth, though.
- This effectively changes the TC for the progression test, so will have some effect there. Maybe that's something to consider merging shortly after release (i.e. when we usually update the reference nps?).
To be discussed.
I think this looks good. By averaging the nps over the parallel benches, probably the result is pretty reliable also for big core workers. The worker output (not sent to the server) could actually print min / max / average and std. dev of the measured benches.
If at a later stage we see that this worker output is strange, we can always think about making it more robust (for example, the running worker process could run multiple benches in sequence, until told to stop, getting a stop only if the speed has become statistically meaningful).
@vondele refactored to bench at different depths, print statistics, feel free to test with one of your workers adding test to DEV.
Dual Xeon e5-2680v3:
Click to view
- test at 8 thread
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_clang++_19.1.7_env_25b32ad706:
Concurrency : 6.00
Threads : 1.00
Depth : 13.00
Mean nodes : 2581469.00
Mean time (ms) : 5159.33
Mean nps : 500369.18
Median nps : 500238.70
Min nps : 496245.48
Max nps : 504488.76
Std nps : 3452.50
Std (%) : 0.69
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_clang++_19.1.7_env_25b32ad706:
Concurrency : 6.00
Threads : 8.00
Depth : 13.00
Mean nodes : 15418669.67
Mean time (ms) : 8460.00
Mean nps : 227595.71
Median nps : 226277.58
Min nps : 216669.57
Max nps : 238263.01
Std nps : 8448.50
Std (%) : 3.71
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_clang++_19.1.7_env_25b32ad706:
Concurrency : 6.00
Threads : 8.00
Depth : 15.00
Mean nodes : 30438644.67
Mean time (ms) : 17910.17
Mean nps : 212327.81
Median nps : 212579.27
Min nps : 202593.01
Max nps : 223973.59
Std nps : 8462.53
Std (%) : 3.99
- test at 1 thread
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_clang++_19.1.7_env_25b32ad706:
Concurrency : 48.00
Threads : 1.00
Depth : 13.00
Mean nodes : 2581469.00
Mean time (ms) : 21101.69
Mean nps : 122822.95
Median nps : 121753.71
Min nps : 109185.34
Max nps : 140994.54
Std nps : 7906.37
Std (%) : 6.44
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_clang++_19.1.7_env_25b32ad706:
Concurrency : 48.00
Threads : 1.00
Depth : 15.00
Mean nodes : 5618477.00
Mean time (ms) : 45923.19
Mean nps : 122798.45
Median nps : 120193.22
Min nps : 113150.28
Max nps : 145533.78
Std nps : 7847.12
Std (%) : 6.39
i7 3770k:
Click to view
- test at 8 threads
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_14.2.0_env_25b32ad706:
Concurrency : 1.00
Threads : 1.00
Depth : 13.00
Mean nodes : 2581469.00
Mean time (ms) : 4501.00
Mean nps : 573532.33
Median nps : 573532.33
Min nps : 573532.33
Max nps : 573532.33
Std nps : 0.00
Std (%) : 0.00
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_14.2.0_env_25b32ad706:
Concurrency : 1.00
Threads : 8.00
Depth : 13.00
Mean nodes : 16950655.00
Mean time (ms) : 5812.00
Mean nps : 364561.58
Median nps : 364561.58
Min nps : 364561.58
Max nps : 364561.58
Std nps : 0.00
Std (%) : 0.00
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_14.2.0_env_25b32ad706:
Concurrency : 1.00
Threads : 8.00
Depth : 15.00
Mean nodes : 34282925.00
Mean time (ms) : 11848.00
Mean nps : 361695.28
Median nps : 361695.28
Min nps : 361695.28
Max nps : 361695.28
Std nps : 0.00
Std (%) : 0.00
- test at 1 thread
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_14.2.0_env_25b32ad706:
Concurrency : 8.00
Threads : 1.00
Depth : 13.00
Mean nodes : 2581469.00
Mean time (ms) : 11712.25
Mean nps : 220411.17
Median nps : 220402.99
Min nps : 219102.78
Max nps : 222272.17
Std nps : 948.47
Std (%) : 0.43
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_14.2.0_env_25b32ad706:
Concurrency : 8.00
Threads : 1.00
Depth : 15.00
Mean nodes : 5618477.00
Mean time (ms) : 24276.38
Mean nps : 231442.92
Median nps : 231675.45
Min nps : 229344.31
Max nps : 232822.68
Std nps : 1132.35
Std (%) : 0.49
So, some experiment on a 4 socket system is attached below. One setup where we have 1 worker per socket having concurrency 70 ('the normal one'), and one where the 4 sockets is one worker with concurrency 280 (not used so far). I'm not so sure I fully understand the output.
So, some experiment on a 4 socket system is attached below. One setup where we have 1 worker per socket having concurrency 70 ('the normal one'), and one where the 4 sockets is one worker with concurrency 280 (not used so far). I'm not so sure I fully understand the output.
The PR right now has code to get data and decide how to finalize the code.
Here your 70 cores worker has played a SMP test with 8 threads
- first run (no SMP) to get the signature and to load the CPU for the bench
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency : 8.00
Threads : 1.00
Depth : 13.00
Mean nodes : 2581469.00
Mean time (ms) : 2899.50
Mean nps : 890339.43
Median nps : 889856.57
Min nps : 882553.50
Max nps : 897902.26
Std nps : 4962.26
Std (%) : 0.56
- optional second run to get the bench in SMP mode (skipped if threads = 1)
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency : 8.00
Threads : 8.00
Depth : 13.00
Mean nodes : 16462227.75
Mean time (ms) : 2180.88
Mean nps : 943494.35
Median nps : 943172.67
Min nps : 930369.25
Max nps : 968025.37
Std nps : 12214.14
Std (%) : 1.29
- third run to get the bench at depth 15 (to be compared with depth 13, depth 15 takes twice the time to be completed)
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency : 8.00
Threads : 8.00
Depth : 15.00
Mean nodes : 35905318.88
Mean time (ms) : 4658.38
Mean nps : 956206.88
Median nps : 925305.11
Min nps : 897897.14
Max nps : 1133735.16
Std nps : 75539.71
Std (%) : 7.90
Here your 70 cores worker has played a normal test with 1 thread
- first run (no SMP) to get the signature and to load the CPU for the bench
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency : 70.00
Threads : 1.00
Depth : 13.00
Mean nodes : 2581469.00
Mean time (ms) : 4644.51
Mean nps : 555904.28
Median nps : 556111.48
Min nps : 538479.14
Max nps : 567730.15
Std nps : 7257.03
Std (%) : 1.31
- third run to get the bench at depth 15 (to be compared with depth 13, depth 15 takes twice the time to be completed)
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency : 70.00
Threads : 1.00
Depth : 15.00
Mean nodes : 5618477.00
Mean time (ms) : 10038.73
Mean nps : 559725.93
Median nps : 559525.67
Min nps : 547556.48
Max nps : 570751.42
Std nps : 5095.36
Std (%) : 0.91
Your worker at 70 cores completes the bench at depth 13 in 4.6s with threads=1 and 2.2s with threads=8, the relative std (to the average) is low with depth=13 and raise with depth=15 in SMP mode (as expected, the bench is not deterministic). nps at depth=13 and depth=15 have similar values.
Your worker at 70 cores right now, with bench from fishtest master, is playing with 62% the correct normalized TC in test at threads=1.
The worker at 280 core has a very bad Std (%)
- SMP test at threads=8
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency : 35.00
Threads : 1.00
Depth : 13.00
Mean nodes : 2581469.00
Mean time (ms) : 6386.57
Mean nps : 408897.36
Median nps : 392678.58
Min nps : 386910.82
Max nps : 696376.85
Std nps : 55421.42
Std (%) : 13.55
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency : 35.00
Threads : 8.00
Depth : 13.00
Mean nodes : 23262588.37
Mean time (ms) : 11246.20
Mean nps : 291680.08
Median nps : 192934.41
Min nps : 151146.84
Max nps : 891885.50
Std nps : 192665.80
Std (%) : 66.05
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency : 35.00
Threads : 8.00
Depth : 15.00
Mean nodes : 49991699.43
Mean time (ms) : 28419.77
Mean nps : 228077.28
Median nps : 193784.30
Min nps : 119811.77
Max nps : 486916.21
Std nps : 99587.98
Std (%) : 43.66
- normal test at thread=1
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency : 280.00
Threads : 1.00
Depth : 13.00
Mean nodes : 2581469.00
Mean time (ms) : 23777.35
Mean nps : 146096.75
Median nps : 78968.18
Min nps : 72978.51
Max nps : 316007.96
Std nps : 86195.98
Std (%) : 59.00
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency : 280.00
Threads : 1.00
Depth : 15.00
Mean nodes : 5618477.00
Mean time (ms) : 45364.72
Mean nps : 158492.64
Median nps : 116133.40
Min nps : 81762.55
Max nps : 326769.63
Std nps : 84617.85
Std (%) : 53.39
@Disservin @vondele in SMP tests depht=13 has a better Std (%) wrt depth=15 and in normal test the Std (%) is low even with worker at concurrency=70.
IMO the PR is ready, we have collected information to value the impact on the regression test, it's up to you decide when merge (or right now or after SF 17.1).
BTW before the introduction of fastchess, the laptop worker at concurrency=7 got a decent NPS bench estimation (statistically we benched 1 worker while other 9 were loading the CPU/memory). Now, at concurrency=70, the NPS bench is proved wrong. From this point of view the switch to a higher concurrency worker disrupted the regression test and biased the normal tests, this PR should simply fix the issues, and we should merge it before the big workers will join.
I think it still needs some testing to decide here.
What I'm assuming is that it is actually the OS that has a bit of a problem loading 70 100+MB binaries, and during that time run a +- 2s bench. The situation during game play might be different, as the same binary is kept alive for longer time. I'll play a bit locally to see if I can understand this better.
I think it still needs some testing to decide here.
What I'm assuming is that it is actually the OS that has a bit of a problem loading 70 100+MB binaries, and during that time run a +- 2s bench. The situation during game play might be different, as the same binary is kept alive for longer time. I'll play a bit locally to see if I can understand this better.
With concurrency = 70 and threads = 1 your worker takes 4.6 seconds for depth=13 and 10 seconds for depth=15, NPS benches very similar, both with low Std (%).
Right now, the PR runs 2 consecutive (the statistica are computed later) benches, the first one to get the signature and to preload CPU/RAM/caches. For SMP the first bench is run with concurrency * threads to preload in proper way all the silicon involved in the second bench.
Takes your time to experiment with the PR, but please keep in mind that your workers at 70 cores have a very wrong bench (and adjusted TC) wrt your workers at 7 cores. Right now, your workers at 70 cores are biasing the framework.
first, afaict there is no real difference between the 7 or the 70 core workers, that's just a different binding of workers on the same socket. I did some tests with modified SF, to run multiple benches in a row, there is no real effect of that, so the current method seems to measure things more or less as expected (at least on 1 socket). What we're seeing is, I assume, the slowdown with each concurrent process having its copy on the net competing for the same cache. That's why the behavior is very different loading with processes or with threads. This can be seen well in this test (which is using the python code in this test):
I actually think other workers will show a similar behavior. However, that's also how we currently test, so it does reflect reality.
I'm talking about fishtest with master. The difference is that with 10 worker@7, after a while, we are benching 1 worker while other 9 are running tests competing for the caches. So the single worker is benched at the right side of your chart. Benching 1 worker@70, we are running only 2 SF processes, cache always free, bench at left side of your chart. From your chart and your previous runs, the TC is nearly 40% off.
OK, I see what you mean now, at startup the difference is not big, but once the other workers are running tests, it does make a difference.
OK, I think this is ready to go.
Triggered the workers update, thank you @Viren6 @jw1912 :)