fishtest Fix NPS measurement for TC scaling

This PR modifies the NPS measurement for TC scaling to more closely resemble actual testing conditions. In particular, it addresses the point raised in https://github.com/official-stockfish/fishtest/issues/2077

Currently we run one process with a bench and one process with the search of n-1 threads. This doesn't account for these RAM bandwidth limitations discussed and therefore the measured nps is far faster than the real nps.

Instead, this PR runs a bench process for each active core and takes the average NPS.

Jun 19 '24 13:06 Viren6

Tested with https://tests.stockfishchess.org/tests/view/66729718602682471b065101?show_task=66 Works as expected

Jun 19 '24 13:06 Viren6

I think that's reasonable direction. Things to consider (idk the right answer)

this will penalise SMP tests, where the actual nps will be higher than the one measured in this way. Could be solved by doing some SMP measurement for SMP tests.
I have observed that on very large core workers the 1 second test might actually not be such a good measurement, as the system is spawning engines and only once everything is running the measurement becomes stable.
This effectively changes the TC for the progression test, so will have some effect there. Maybe that's something to consider merging shortly after release (i.e. when we usually update the reference nps?).

Jun 19 '24 14:06 vondele

master vs PR

Jan 06 '25 18:01 ppigazzini

@Disservin the workers look good with the PR. As highlighted by @vondele, merging the PR will change the TC for system with high concurrency, with a jump in the PT.

Jan 06 '25 20:01 ppigazzini

New commit with: simplification with concurrent.futures, proper exception management, bench for SMP use case.

MNps master vs PR for both normal test (thread=1) and SMP test (threads=8). As expected with master there is not a difference for threads > 1.

code	test	Dual Xeon bmi2	Zen 4 vnni-256	core i7 popcnt
concurrency	virtual cores	48	16	8
master	1 thread	0.21	0.36	0.25
master	SMP 8 threads	0.21	0.37	0.26
PR	1 thread	0.13	0.15	0.18
PR	SMP 8 threads	0.24	0.44	0.40

Jan 09 '25 21:01 ppigazzini

I experimented a bit with stockfish speedtest as potential replacement of stockfish bench for the worker nps benchmark, but I didn't find a real difference in precision (ratio stdev / average) when setting a comparable time for the 2 benchmarks.

I have these concerns in using speedtest (I didn't follow the development; I think that I'm missing something):

only bench can verify the signature of the engine
bench does the same deterministic computation for any worker (with 1 thread)
speedtest should be started with a normalized time to do comparable computations on different workers. If the normalized time is computed with the first run of bench (validation signature), we still have a bias on bench
on a powerful CPU bench at depth 13 takes less than 1 second. If I'm not wrong speedtest seems to take only an integer parameter for the time, so the normalization of the time between workers requires a longer computation

Here are my tests:

core i7 3770k - bench at depth 13 and speedtest at equivalent time

Click to view

$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 13 20
run   sf_base   sf_test      diff
  1    642062    638495     -3567
  2    637904    645367     +7463
  3    645669    644462     -1207
  4    635846    645669     +9823
  5    635552    641465     +5913
  6    627171    637020     +9849
  7    642361    640868     -1493
  8    636139    638791     +2652
  9    639976    646577     +6601
 10    636139    635260      -879
 11    636432    640273     +3841
 12    642361    643860     +1499
 13    636139    645367     +9228
 14    635260    637020     +1760
 15    637315    648401    +11086
 16    634091    641166     +7075
 17    637904    645367     +7463
 18    636726    645367     +8641
 19    643560    643860      +300
 20    641763    643860     +2097

sf_base =   638018 +/-   1820 (95%)
sf_test =   642425 +/-   1609 (95%)
diff    =     4407 +/-   1951 (95%)
speedup =   0.691% +/- 0.306% (95%)


real    0m54.408s
user    1m45.462s
sys     0m1.335s

average	638018.5	642425.75	4407.25
stdev.s	4154.695562	3673.531713	4451.640122
ratio	0.006511873	0.00571822	0.006953274


$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 3 20
run   sf_base   sf_test      diff
  1    526415    528228     +1813
  2    523249    529589     +6340
  3    521034    525763     +4729
  4    521139    524121     +2982
  5    523099    522704      -395
  6    517411    522574     +5163
  7    518389    524723     +6334
  8    521461    527732     +6271
  9    527087    526548      -539
 10    519574    528365     +8791
 11    521661    529280     +7619
 12    525403    525319       -84
 13    520944    518950     -1994
 14    517173    526080     +8907
 15    516447    518368     +1921
 16    519563    522172     +2609
 17    519312    518803      -509
 18    518095    523615     +5520
 19    519267    524004     +4737
 20    513206    522394     +9188

sf_base =   520496 +/-   1508 (95%)
sf_test =   524466 +/-   1477 (95%)
diff    =     3970 +/-   1534 (95%)
speedup =   0.763% +/- 0.295% (95%)


real    1m12.916s
user    2m23.886s
sys     0m2.049s

average	520496.45	524466.6	3970.15
stdev.s	3441.513796	3370.420345	3500.21296
ratio	0.006611983	0.006426377	0.006699209

ryzen 7 4880u - bench at depth 13 and speedtest at equivalent time

Click to view

$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 13 20
run   sf_base   sf_test      diff
  1   1139793   1110426    -29367
  2   1139793   1136974     -2819
  3   1153135   1167781    +14646
  4   1166793   1159924     -6869
  5   1168771   1164822     -3949
  6   1153135   1146425     -6710
  7   1151210   1159924     +8714
  8   1143573   1153135     +9562
  9   1147379   1164822    +17443
 10   1149291   1158949     +9658
 11   1154100   1137912    -16188
 12   1140736   1152172    +11436
 13   1136037   1141680     +5643
 14   1134169   1136974     +2805
 15   1141680   1135102     -6578
 16   1122172   1120349     -1823
 17   1128600   1120349     -8251
 18   1110426   1099800    -10626
 19   1121260   1112217     -9043
 20   1101557   1122172    +20615

sf_base =  1140180 +/-   7505 (95%)
sf_test =  1140095 +/-   8948 (95%)
diff    =      -85 +/-   5437 (95%)
speedup =  -0.007% +/- 0.477% (95%)


real    0m27.261s
user    0m52.480s
sys     0m2.077s

average	1140180.5	1140095.45	-85.05		
stdev.s	17125.97215	20418.04072	12406.09801
ratio	0.015020404	0.017909063	0.010881225


$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 1.4 20
run   sf_base   sf_test      diff
  1   1002991    977770    -25221
  2    980706    988994     +8288
  3    980046    976723     -3323
  4   1004739    983160    -21579
  5    982322    989259     +6937
  6    986482    985358     -1124
  7    984034    976997     -7037
  8   1002855    988772    -14083
  9    975933    962120    -13813
 10    992360    965047    -27313
 11    974111    983993     +9882
 12   1004514    992087    -12427
 13    970015    991746    +21731
 14    956093    978705    +22612
 15    965092    982835    +17743
 16    954497    940059    -14438
 17    968743    956396    -12347
 18    956676    955077     -1599
 19    947581    945381     -2200
 20    942819    950398     +7579

sf_base =   976630 +/-   8384 (95%)
sf_test =   973543 +/-   7232 (95%)
diff    =    -3086 +/-   6517 (95%)
speedup =  -0.316% +/- 0.667% (95%)


real    0m23.242s
user    0m40.157s
sys     0m2.577s

average	976630.45	973543.85	-3086.6
stdev.s	19129.792	16503.11905	14869.88728
ratio	0.019587544	0.016951593	0.015249803

dual xeon e5-2680v3 - bench at depth 13 and speedtest at equivalent time

Click to view

$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 13 20
run   sf_base   sf_test      diff
  1    582411    611054    +28643
  2    600936    608627     +7691
  3    556108    567551    +11443
  4    573451    595487    +22036
  5    586872    575605    -11267
  6    603566    558586    -44980
  7    587372    579718     -7654
  8    596518    582411    -14107
  9    612411    602775     -9636
 10    583396    590642     +7246
 11    598849    590389     -8460
 12    582165    566851    -15314
 13    589884    586373     -3511
 14    553208    583890    +30682
 15    592927    601986     +9059
 16    578745    564300    -14445
 17    578259    585377     +7118
 18    551219    565921    +14702
 19    578988    596002    +17014
 20    599630    596002     -3628

sf_base =   584345 +/-   7273 (95%)
sf_test =   585477 +/-   6744 (95%)
diff    =     1131 +/-   7880 (95%)
speedup =   0.194% +/- 1.349% (95%)


real    0m58.222s
user    1m51.579s
sys     0m3.298s

average	584345.75	585477.35	1131.6
stdev.s	16597.09802	15389.8147	17981.80839
ratio	0.028402873	0.026285927	0.030742782


$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 3 20
run   sf_base   sf_test      diff
  1    493811    487185     -6626
  2    461326    498548    +37222
  3    483797    483261      -536
  4    477423    444543    -32880
  5    492298    488488     -3810
  6    491763    500013     +8250
  7    480248    471136     -9112
  8    491465    499068     +7603
  9    489841    490662      +821
 10    477191    496783    +19592
 11    486842    496983    +10141
 12    481331    482163      +832
 13    476472    491636    +15164
 14    469090    478852     +9762
 15    483396    491132     +7736
 16    493891    486311     -7580
 17    497079    491734     -5345
 18    493707    493684       -23
 19    484103    493925     +9822
 20    461060    477934    +16874

sf_base =   483306 +/-   4615 (95%)
sf_test =   487202 +/-   5555 (95%)
diff    =     3895 +/-   6174 (95%)
speedup =   0.806% +/- 1.278% (95%)


real    1m12.835s
user    2m19.874s
sys     0m4.604s

average	483306.7	487202.05	3895.35
stdev.s	10531.58584	12675.80741	14088.20323
ratio	0.021790689	0.026017558	0.029032615

core i7 3770k - bench at depth 20 and speedtest at equivalent time

Click to view

$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 20 10
run   sf_base   sf_test      diff
  1    814720    820250     +5530
  2    821560    824853     +3293
  3    815993    820013     +4020
  4    812556    820033     +7477
  5    815719    825736    +10017
  6    825214    834174     +8960
  7    802381    805980     +3599
  8    811954    819854     +7900
  9    823653    825154     +1501
 10    825094    829163     +4069

sf_base =   816884 +/-   4453 (95%)
sf_test =   822521 +/-   4624 (95%)
diff    =     5636 +/-   1735 (95%)
speedup =   0.690% +/- 0.212% (95%)


real    7m59.444s
user    15m55.181s
sys     0m0.667s

average	816884.4	822521	5636.6
stdev.s	7185.516531	7460.7933	2800.248687
ratio	0.008796246	0.009070642	0.003416176


$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 48 10
run   sf_base   sf_test      diff
  1    500816    512525    +11709
  2    505838    509329     +3491
  3    498995    504984     +5989
  4    502396    506823     +4427
  5    496036    503498     +7462
  6    496789    504979     +8190
  7    491413    491360       -53
  8    500142    508236     +8094
  9    492447    494211     +1764
 10    497343    496064     -1279

sf_base =   498221 +/-   2723 (95%)
sf_test =   503200 +/-   4340 (95%)
diff    =     4979 +/-   2528 (95%)
speedup =   0.999% +/- 0.508% (95%)


real    9m25.012s
user    18m48.269s
sys     0m1.121s

average	498221.5	503200.9	4979.4
stdev.s	4393.888767	7002.810554	4080.178297
ratio	0.008819147	0.01391653	0.008148766

ryzen 7 4880u - bench at depth 20 and speedtest at equivalent time

Click to view

$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 20 10
run   sf_base   sf_test      diff
  1   1420699   1420462      -237
  2   1416137   1449753    +33616
  3   1433292   1435595     +2303
  4   1453538   1466406    +12868
  5   1402336   1447775    +45439
  6   1429611   1438331     +8720
  7   1438026   1435352     -2674
  8   1457781   1471170    +13389
  9   1454472   1437174    -17298
 10   1451860   1433111    -18749

sf_base =  1435775 +/-  11666 (95%)
sf_test =  1443512 +/-   9650 (95%)
diff    =     7737 +/-  12533 (95%)
speedup =   0.539% +/- 0.873% (95%)


real    4m22.872s
user    8m41.477s
sys     0m1.139s)

average	1435775.2	1443512.9	7737.7
stdev.s	18822.04371	15570.57138	20221.45002
ratio	0.013109325	0.010786583	0.014046146


$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 27 10
run   sf_base   sf_test      diff
  1    961782    968902     +7120
  2    969845    960509     -9336
  3    921825    951732    +29907
  4    948890    948322      -568
  5    952444    956790     +4346
  6    951687    939523    -12164
  7    954338    970641    +16303
  8    869486    879096     +9610
  9    947384    940568     -6816
 10    955128    954831      -297

sf_base =   943280 +/-  17796 (95%)
sf_test =   947091 +/-  16150 (95%)
diff    =     3810 +/-   7891 (95%)
speedup =   0.404% +/- 0.837% (95%)


real    5m5.283s
user    10m8.288s
sys     0m1.521s

average	943280.9	947091.4	3810.5
stdev.s	28712.21306	26056.65103	12732.04205
ratio	0.030438667	0.027512288	0.013470407

dual xeon e5-2680v3 - bench at depth 20 and speedtest at equivalent time

Click to view

$ time bash bench_parallel.sh ./stockfish_clang ./stockfish_clang_native 20 10
run   sf_base   sf_test      diff
  1    768409    750526    -17883
  2    753189    757913     +4724
  3    759048    748640    -10408
  4    738689    731056     -7633
  5    738770    752722    +13952
  6    751223    746582     -4641
  7    748623    748772      +149
  8    753908    756192     +2284
  9    747733    752005     +4272
 10    741253    735840     -5413

sf_base =   750084 +/-   5789 (95%)
sf_test =   748024 +/-   5259 (95%)
diff    =    -2059 +/-   5602 (95%)
speedup =  -0.275% +/- 0.747% (95%)


real    8m28.918s
user    16m50.191s
sys     0m2.022s

average	750084.5	748024.8	-2059.7
stdev.s	9340.599222	8485.735165	9038.620667
ratio	0.012452729	0.01134419	0.012066704


$ time bash speedtest_parallel.sh ./stockfish_clang ./stockfish_clang_native 50 10
run   sf_base   sf_test      diff
  1    482089    479438     -2651
  2    475337    463893    -11444
  3    455283    466308    +11025
  4    475370    463149    -12221
  5    479778    470888     -8890
  6    465646    473220     +7574
  7    473976    474571      +595
  8    468684    464385     -4299
  9    465412    468351     +2939
 10    461725    456763     -4962

sf_base =   470330 +/-   5229 (95%)
sf_test =   468096 +/-   4091 (95%)
diff    =    -2233 +/-   4834 (95%)
speedup =  -0.475% +/- 1.028% (95%)


real    9m30.372s
user    18m57.413s
sys     0m2.675s

average	470330		468096.6	-2233.4
stdev.s	8436.678388	6601.562293	7799.628299
ratio	0.017937785	0.014102991	0.016622778

speedtest has a slightly higher ratio of the benches between CPU, the ratio is ferly steady for either bench or speedtest from depth 13 to depth 20

	i7 3770k	ryzen7 4880u	dual xeon e5-2680v3	i7 3770k	ryzen7 4880u	dual xeon e5-2680v3
bench 13	638018	1140180	584345	1	1.787065569	0.915875414
speedtest	520496	976630	483306	1	1.876344871	0.928548923

bench 20	816884	1435775	750084	1	1.757624094	0.918225844
speedtest	498221	943280	470330	1	1.893296348	0.944018819

Jan 13 '25 18:01 ppigazzini

this will penalise SMP tests, where the actual nps will be higher than the one measured in this way. Could be solved by doing some SMP measurement for SMP tests.

Done.

I have observed that on very large core workers the 1 second test might actually not be such a good measurement, as the system is spawning engines and only once everything is running the measurement becomes stable.

We can run bench a second time with a depth > 13, the code change is easy. We should compute the references values with the new depth, though.

This effectively changes the TC for the progression test, so will have some effect there. Maybe that's something to consider merging shortly after release (i.e. when we usually update the reference nps?).

To be discussed.

Jan 13 '25 18:01 ppigazzini

I think this looks good. By averaging the nps over the parallel benches, probably the result is pretty reliable also for big core workers. The worker output (not sent to the server) could actually print min / max / average and std. dev of the measured benches.

If at a later stage we see that this worker output is strange, we can always think about making it more robust (for example, the running worker process could run multiple benches in sequence, until told to stop, getting a stop only if the speed has become statistically meaningful).

Feb 12 '25 17:02 vondele

@vondele refactored to bench at different depths, print statistics, feel free to test with one of your workers adding test to DEV.

Dual Xeon e5-2680v3:

Click to view

test at 8 thread

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_clang++_19.1.7_env_25b32ad706:
Concurrency    :            6.00
Threads        :            1.00
Depth          :           13.00
Mean nodes     :      2581469.00
Mean time (ms) :         5159.33
Mean nps       :       500369.18
Median nps     :       500238.70
Min nps        :       496245.48
Max nps        :       504488.76
Std nps        :         3452.50
Std (%)        :            0.69
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_clang++_19.1.7_env_25b32ad706:
Concurrency    :            6.00
Threads        :            8.00
Depth          :           13.00
Mean nodes     :     15418669.67
Mean time (ms) :         8460.00
Mean nps       :       227595.71
Median nps     :       226277.58
Min nps        :       216669.57
Max nps        :       238263.01
Std nps        :         8448.50
Std (%)        :            3.71
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_clang++_19.1.7_env_25b32ad706:
Concurrency    :            6.00
Threads        :            8.00
Depth          :           15.00
Mean nodes     :     30438644.67
Mean time (ms) :        17910.17
Mean nps       :       212327.81
Median nps     :       212579.27
Min nps        :       202593.01
Max nps        :       223973.59
Std nps        :         8462.53
Std (%)        :            3.99

test at 1 thread

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_clang++_19.1.7_env_25b32ad706:
Concurrency    :           48.00
Threads        :            1.00
Depth          :           13.00
Mean nodes     :      2581469.00
Mean time (ms) :        21101.69
Mean nps       :       122822.95
Median nps     :       121753.71
Min nps        :       109185.34
Max nps        :       140994.54
Std nps        :         7906.37
Std (%)        :            6.44

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_clang++_19.1.7_env_25b32ad706:
Concurrency    :           48.00
Threads        :            1.00
Depth          :           15.00
Mean nodes     :      5618477.00
Mean time (ms) :        45923.19
Mean nps       :       122798.45
Median nps     :       120193.22
Min nps        :       113150.28
Max nps        :       145533.78
Std nps        :         7847.12
Std (%)        :            6.39

i7 3770k:

Click to view

test at 8 threads

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_14.2.0_env_25b32ad706:
Concurrency    :            1.00
Threads        :            1.00
Depth          :           13.00
Mean nodes     :      2581469.00
Mean time (ms) :         4501.00
Mean nps       :       573532.33
Median nps     :       573532.33
Min nps        :       573532.33
Max nps        :       573532.33
Std nps        :            0.00
Std (%)        :            0.00
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_14.2.0_env_25b32ad706:
Concurrency    :            1.00
Threads        :            8.00
Depth          :           13.00
Mean nodes     :     16950655.00
Mean time (ms) :         5812.00
Mean nps       :       364561.58
Median nps     :       364561.58
Min nps        :       364561.58
Max nps        :       364561.58
Std nps        :            0.00
Std (%)        :            0.00
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_14.2.0_env_25b32ad706:
Concurrency    :            1.00
Threads        :            8.00
Depth          :           15.00
Mean nodes     :     34282925.00
Mean time (ms) :        11848.00
Mean nps       :       361695.28
Median nps     :       361695.28
Min nps        :       361695.28
Max nps        :       361695.28
Std nps        :            0.00
Std (%)        :            0.00

test at 1 thread

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_14.2.0_env_25b32ad706:
Concurrency    :            8.00
Threads        :            1.00
Depth          :           13.00
Mean nodes     :      2581469.00
Mean time (ms) :        11712.25
Mean nps       :       220411.17
Median nps     :       220402.99
Min nps        :       219102.78
Max nps        :       222272.17
Std nps        :          948.47
Std (%)        :            0.43
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_14.2.0_env_25b32ad706:
Concurrency    :            8.00
Threads        :            1.00
Depth          :           15.00
Mean nodes     :      5618477.00
Mean time (ms) :        24276.38
Mean nps       :       231442.92
Median nps     :       231675.45
Min nps        :       229344.31
Max nps        :       232822.68
Std nps        :         1132.35
Std (%)        :            0.49

Feb 14 '25 17:02 ppigazzini

So, some experiment on a 4 socket system is attached below. One setup where we have 1 worker per socket having concurrency 70 ('the normal one'), and one where the 4 sockets is one worker with concurrency 280 (not used so far). I'm not so sure I fully understand the output.

summary.txt

Feb 15 '25 10:02 vondele

So, some experiment on a 4 socket system is attached below. One setup where we have 1 worker per socket having concurrency 70 ('the normal one'), and one where the 4 sockets is one worker with concurrency 280 (not used so far). I'm not so sure I fully understand the output.

summary.txt

The PR right now has code to get data and decide how to finalize the code.

Here your 70 cores worker has played a SMP test with 8 threads

first run (no SMP) to get the signature and to load the CPU for the bench

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency    :            8.00
Threads        :            1.00
Depth          :           13.00
Mean nodes     :      2581469.00
Mean time (ms) :         2899.50
Mean nps       :       890339.43
Median nps     :       889856.57
Min nps        :       882553.50
Max nps        :       897902.26
Std nps        :         4962.26
Std (%)        :            0.56

optional second run to get the bench in SMP mode (skipped if threads = 1)

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency    :            8.00
Threads        :            8.00
Depth          :           13.00
Mean nodes     :     16462227.75
Mean time (ms) :         2180.88
Mean nps       :       943494.35
Median nps     :       943172.67
Min nps        :       930369.25
Max nps        :       968025.37
Std nps        :        12214.14
Std (%)        :            1.29

third run to get the bench at depth 15 (to be compared with depth 13, depth 15 takes twice the time to be completed)

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency    :            8.00
Threads        :            8.00
Depth          :           15.00
Mean nodes     :     35905318.88
Mean time (ms) :         4658.38
Mean nps       :       956206.88
Median nps     :       925305.11
Min nps        :       897897.14
Max nps        :      1133735.16
Std nps        :        75539.71
Std (%)        :            7.90

Here your 70 cores worker has played a normal test with 1 thread

first run (no SMP) to get the signature and to load the CPU for the bench

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency    :           70.00
Threads        :            1.00
Depth          :           13.00
Mean nodes     :      2581469.00
Mean time (ms) :         4644.51
Mean nps       :       555904.28
Median nps     :       556111.48
Min nps        :       538479.14
Max nps        :       567730.15
Std nps        :         7257.03
Std (%)        :            1.31

third run to get the bench at depth 15 (to be compared with depth 13, depth 15 takes twice the time to be completed)

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency    :           70.00
Threads        :            1.00
Depth          :           15.00
Mean nodes     :      5618477.00
Mean time (ms) :        10038.73
Mean nps       :       559725.93
Median nps     :       559525.67
Min nps        :       547556.48
Max nps        :       570751.42
Std nps        :         5095.36
Std (%)        :            0.91

Feb 15 '25 12:02 ppigazzini

Your worker at 70 cores completes the bench at depth 13 in 4.6s with threads=1 and 2.2s with threads=8, the relative std (to the average) is low with depth=13 and raise with depth=15 in SMP mode (as expected, the bench is not deterministic). nps at depth=13 and depth=15 have similar values.

Your worker at 70 cores right now, with bench from fishtest master, is playing with 62% the correct normalized TC in test at threads=1.

Feb 15 '25 13:02 ppigazzini

The worker at 280 core has a very bad Std (%)

SMP test at threads=8

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency    :           35.00
Threads        :            1.00
Depth          :           13.00
Mean nodes     :      2581469.00
Mean time (ms) :         6386.57
Mean nps       :       408897.36
Median nps     :       392678.58
Min nps        :       386910.82
Max nps        :       696376.85
Std nps        :        55421.42
Std (%)        :           13.55
Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency    :           35.00
Threads        :            8.00
Depth          :           13.00
Mean nodes     :     23262588.37
Mean time (ms) :        11246.20
Mean nps       :       291680.08
Median nps     :       192934.41
Min nps        :       151146.84
Max nps        :       891885.50
Std nps        :       192665.80
Std (%)        :           66.05

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency    :           35.00
Threads        :            8.00
Depth          :           15.00
Mean nodes     :     49991699.43
Mean time (ms) :        28419.77
Mean nps       :       228077.28
Median nps     :       193784.30
Min nps        :       119811.77
Max nps        :       486916.21
Std nps        :        99587.98
Std (%)        :           43.66

normal test at thread=1

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency    :          280.00
Threads        :            1.00
Depth          :           13.00
Mean nodes     :      2581469.00
Mean time (ms) :        23777.35
Mean nps       :       146096.75
Median nps     :        78968.18
Min nps        :        72978.51
Max nps        :       316007.96
Std nps        :        86195.98
Std (%)        :           59.00

Statistic for stockfish_fa6c30af814fe91e6a6c2d1bcaa8d951e3724ae7_g++_13.3.0_env_25b32ad706:
Concurrency    :          280.00
Threads        :            1.00
Depth          :           15.00
Mean nodes     :      5618477.00
Mean time (ms) :        45364.72
Mean nps       :       158492.64
Median nps     :       116133.40
Min nps        :        81762.55
Max nps        :       326769.63
Std nps        :        84617.85
Std (%)        :           53.39

Feb 15 '25 13:02 ppigazzini

@Disservin @vondele in SMP tests depht=13 has a better Std (%) wrt depth=15 and in normal test the Std (%) is low even with worker at concurrency=70. IMO the PR is ready, we have collected information to value the impact on the regression test, it's up to you decide when merge (or right now or after SF 17.1). BTW before the introduction of fastchess, the laptop worker at concurrency=7 got a decent NPS bench estimation (statistically we benched 1 worker while other 9 were loading the CPU/memory). Now, at concurrency=70, the NPS bench is proved wrong. From this point of view the switch to a higher concurrency worker disrupted the regression test and biased the normal tests, this PR should simply fix the issues, and we should merge it before the big workers will join.

Feb 19 '25 10:02 ppigazzini

I think it still needs some testing to decide here.

What I'm assuming is that it is actually the OS that has a bit of a problem loading 70 100+MB binaries, and during that time run a +- 2s bench. The situation during game play might be different, as the same binary is kept alive for longer time. I'll play a bit locally to see if I can understand this better.

Feb 22 '25 12:02 vondele

I think it still needs some testing to decide here.

What I'm assuming is that it is actually the OS that has a bit of a problem loading 70 100+MB binaries, and during that time run a +- 2s bench. The situation during game play might be different, as the same binary is kept alive for longer time. I'll play a bit locally to see if I can understand this better.

With concurrency = 70 and threads = 1 your worker takes 4.6 seconds for depth=13 and 10 seconds for depth=15, NPS benches very similar, both with low Std (%).

Right now, the PR runs 2 consecutive (the statistica are computed later) benches, the first one to get the signature and to preload CPU/RAM/caches. For SMP the first bench is run with concurrency * threads to preload in proper way all the silicon involved in the second bench.

Takes your time to experiment with the PR, but please keep in mind that your workers at 70 cores have a very wrong bench (and adjusted TC) wrt your workers at 7 cores. Right now, your workers at 70 cores are biasing the framework.

Feb 22 '25 14:02 ppigazzini

first, afaict there is no real difference between the 7 or the 70 core workers, that's just a different binding of workers on the same socket. I did some tests with modified SF, to run multiple benches in a row, there is no real effect of that, so the current method seems to measure things more or less as expected (at least on 1 socket). What we're seeing is, I assume, the slowdown with each concurrent process having its copy on the net competing for the same cache. That's why the behavior is very different loading with processes or with threads. This can be seen well in this test (which is using the python code in this test):

I actually think other workers will show a similar behavior. However, that's also how we currently test, so it does reflect reality.

Feb 22 '25 14:02 vondele

I'm talking about fishtest with master. The difference is that with 10 worker@7, after a while, we are benching 1 worker while other 9 are running tests competing for the caches. So the single worker is benched at the right side of your chart. Benching 1 worker@70, we are running only 2 SF processes, cache always free, bench at left side of your chart. From your chart and your previous runs, the TC is nearly 40% off.

Feb 22 '25 15:02 ppigazzini

OK, I see what you mean now, at startup the difference is not big, but once the other workers are running tests, it does make a difference.

Feb 22 '25 16:02 vondele

OK, I think this is ready to go.

Feb 23 '25 07:02 vondele

Triggered the workers update, thank you @Viren6 @jw1912 :)

Feb 23 '25 09:02 ppigazzini