mame
mame copied to clipboard
taito/taito_f3_v.cpp: regain performance after major rewrite
addresses my own concerns with #11811 speed regression against previous implementation.
- switch AoS z buffers and per-pix blend info to SoA
- allow vectorization of line blending operation
- regains empty line optimization by tracking tilemap row usage
- consolidate sprite framebuffers (we still pull from it multiple times for each sprite priority group)
- other minor wins from safe logic reorderings
-window -nomaximize -bench 240 <set>
of 1
Windows 11 / CI (Windows) / AMD Ryzen 7 7840HS
set | e967a70 pre-rewrite | 563b63fabf7a06c6dc94b48a1db2f8dba7292c15 rewrite | 55c60e5 this pr |
---|---|---|---|
ringrage | 606.71% | 533.30% | 630.98% |
arabianm | 685.52% | 574.87% | 695.64% |
ridingf | 635.20% | 520.65% | 608.64% |
gseeker | 683.36% | 618.98% | 743.95% |
commandw | 630.25% | 560.72% | 634.03% |
hthero93 | 717.49% | 588.87% | 710.68% |
scfinals | 694.80% | 587.99% | 720.88% |
trstar | 690.49% | 578.01% | 706.09% |
gunlock | 609.94% | 553.95% | 639.41% |
lightbr | 668.88% | 545.62% | 651.80% |
kaiserkn | 637.06% | 558.43% | 649.98% |
dariusg | 724.94% | 580.85% | 733.24% |
bubsymphj | 686.23% | 533.20% | 646.04% |
spcinvdj | 721.39% | 577.61% | 729.34% |
hthero95 | 667.58% | 559.33% | 681.61% |
qtheater | 692.09% | 535.16% | 660.61% |
elvactr | 738.29% | 611.27% | 736.44% |
spcinv95 | 670.14% | 561.78% | 655.50% |
twinqix | 721.19% | 581.23% | 699.54% |
tcobra2 | 639.72% | 544.45% | 607.79% |
bubblem | 616.16% | 570.59% | 661.85% |
cleopatr | 593.05% | 497.74% | 606.97% |
arkretrn | 599.81% | 525.94% | 599.10% |
kirameki | 698.47% | 570.16% | 673.45% |
puchicar | 585.04% | 511.76% | 591.27% |
popnpop | 598.05% | 514.02% | 606.06% |
landmakr | 735.18% | 578.79% | 700.83% |
Windows 10 / CI (Windows) / Intel Core i5-7300U
set | e967a70 pre-rewrite | 563b63fabf7a06c6dc94b48a1db2f8dba7292c15 rewrite | 55c60e5 this pr |
---|---|---|---|
ringrage | 289.16% | 248.27% | 295.81% |
arabianm | 313.46% | 272.34% | 333.73% |
ridingf | 285.56% | 226.03% | 263.36% |
gseeker | 299.63% | 283.37% | 337.42% |
commandw | 277.50% | 249.16% | 282.81% |
dariusg | 322.64% | 277.24% | 332.32% |
bubblem | 294.36% | 261.99% | 302.12% |
kirameki | 311.88% | 274.02% | 312.47% |
puchicar | 251.74% | 206.12% | 267.41% |
per-commit benchmark
-window -nomaximize -sound none -bench 60 commandw
of 3
WSL 2.0.9.0 / AMD Ryzen 7 7840HS
commit | description | mean | std.dev. |
---|---|---|---|
072367deb59bbd361902e7cb3ddf006cea01d7bf | pre-fredyeye cleanup | 502.53% | 2.90% |
59ae6c160227e2ae7834edf415072a39a911009e | pre-rewrite | 537.51% | 0.71% |
563b63fabf7a06c6dc94b48a1db2f8dba7292c15 | f3 video rewrite | 466.17% | 1.42% |
593664483642a5261e9035301602d5112174cfaf | vas cleanup | 455.61% | 3.51% |
f91b896cda8343fc41f069b32b7ef527364bdea1 | [rebase point] | 467.18% | 2.77% |
e5e3bd8875d8b6ea87be1d7837d68802632d6d9e | SoA/blend vectorization* | 520.14% | 3.58% |
7dcaecd91d2b4843627d7ac58e776eba97f17c53 | AoSoA mistake fix* | 519.07% | 8.68% |
cbc92f34b83514f127122040d6c565dc2360f612 | merge sprite framebuffers | 526.05% | 0.84% |
734879ea3808ec67f8b9ecee80c493336e630345 | tilemap line usage* | 529.54% | 2.75% |
49e45bdf87f1e0bbc60bb24e90553622f04c0328 | mix_line ref params | 534.88% | 4.01% |
7241a37e768d692651328b5739763b3ef9aa8a4e | text line usage* | 544.14% | 2.16% |
7fce16a9741ad53ca5b2da6ba1bf9b7a911cc2f0 | fix extend+alt case | 535.00% | 5.65% |
8412c5127291bfac4f78fe3dcf78fe7cd829a6f5 | savestate correctness | 532.07% | 1.80% |
53d541de9bf4e9a392e4aa740b2ec24c4e2836e9 | strategic uint or layout jostling? | 541.07% | 3.48% |
* validated in -O1 by callgrind cycle counting
i found commandw to be a good test case because it does heavy playfield and sprite scaling work for most scenes in its attract sequence, however, it does have a 6 second completely blank boot. as shown, most sets recover more unthrottled speed than was lost, and the ones that do not still recover most of it.
this system runs slower in general than many other arcade systems in MAME (the test ryzen here gets ~1700% on ibara
, 4000% mrdo
), but we found that this is not actually due to graphics bottlenecks.
skipping all screen_update work, from here, achieved only ~18% (100 unthrottled percentage points) increase,
while disabling the ensoniq subdevices resulted in ~180% (+1000% to 1635% unthrottled percentage points) increase (it has to emulate, like 3 or 4 processors in there, with synchronization)
pivot layer regression found in vertical games, marking as draft again.