perl5
perl5 copied to clipboard
Newsvuvnviv taint api speedup
inspire by looking at bug https://github.com/Perl/perl5/issues/22653, and remarks in the past over these 3 fns being super important esp for enterprise serialization/deserial/wire format decoding.
They are also sort of related to a very bad failed optimization (MSVC compiler went to "-O0" and added 100s of KBs of redundant code in perl541.dll and some KBs more in XS DLLs), done 2-3 years ago in perl core. But im still working on a fix/diag/analysis/solution for that. This branch of commits covers more about serial/deserial performance, and taking unique advantage that IV NV UVs are no-malloc SV and that they are bodyless.
Plus in one spot, my additional "bodyless" optimization from some years ago disappeared through code churn at https://github.com/Perl/perl5/commit/915544426781d184e3b057e63a20c089a32d3eba I put it back, since bodyless SVs are very light weight. And newSV_type() is very heavy with many many branches inside.
newSV_type() is very heavy
The intention was always that calls to newSV_type() with a static type argument would be inlined. (Hence, your bodiless optimisation would not have disappeared.) Are you finding that this is not the case? Because of -O0?
newSV_type() is very heavy
The intention was always that calls to
newSV_type()with a static type argument would be inlined. (Hence, your bodiless optimisation would not have disappeared.) Are you finding that this is not the case? Because of-O0?
Ah, I've seen there's some discussion on #p5p. I'll try to catch up on that tonight.
repushed branch, fixed -DNO_TAINT_SUPPORT build failure
This p.r. has repeatedly failed to build on one of our CI setups. Please see: https://github.com/Perl/perl5/actions/runs/11336336622/job/31611461445?pr=22662
fixed asserts for 32b ptr builds
macOS (Monterey) 12 (-Uusethreads) passed.
I did no changes except for moving a static assert that failed i386.
macOS (Monterey) 12 (-Uusethreads)
now
# Failed test 'write: stat and lstat returned same values'
# at t/stat.t line 44.
# Structures begin differing at:
# $got->[8] = '1729155292.99179'
# $expected->[8] = '1729155292.76969'
# Looks like you failed 1 test of 43.
../dist/Time-HiRes/t/stat.t ..........................................
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/43 subtests
is this a flip flop timing test that fails regularly?
8 atime last access time in seconds since the epoch
IIRC Win NT kernel API refuses to update access time any faster than 1 full second.
https://github.com/khwilliamson/perl5/commit/68c1b38c44c593a83207de9c96e39fd2d4ee463e
https://github.com/Perl/perl5/issues/19321
related bug tickets I found
for
# Structures begin differing at:
# $got->[8] = '1729155292.99179'
# $expected->[8] = '1729155292.76969'
Thinking about how to move this PR on:
A) I'm unclear as to whether MSVC uses -O1 rather than -O2 because it's faster for compiling the interpreter, or the run-time performance of the interpreter once built. Please could you clarify?
B) These changes would likely get through review much faster if the PR was split up into 3 separate PRs:
- Restoring the bodyless code as-was for the benefit of MSVC (
newSV_typeis inlined by gcc/clang) - Changes to
Perl_vnewSVpvf - The changes to taint
Thinking about how to move this PR on: A) I'm unclear as to whether MSVC uses
-O1rather than-O2because it's faster for compiling the interpreter, or the run-time performance of the interpreter once built. Please could you clarify?
Ancient history P5P posts (Sarathy era/early JDB) say MSVC i386 -O1 is faster, 2-3 devs benchmarked the interp on private code. In 2010s/2020s, I would leave modern/current supported MSVC on -O1, since MSVC will inline and expand the worst possible code blocks in -O2. Like unroll all Perl_croak()s to Perl_vcroak(), or inlining the Perl_sv_magicext() loop into every XSUB inside libperl.dll (LTO visibility). MSVC -O2 also writes, x86 machine code wise, all "mov dest_reg, 1 byte (8 bits) constants" aka "imm8s", as 4 byte constants, with 3 useless null bytes. MSVC in -O1 correctly writes 1 byte operands. -O2 expands all constants to 4 bytes operands.
The optimization logic there is questionable IMO. Perl is memory starved or branch miss starved, perl isn't FP/algebra/math/video codecs starved, And the interp never sits in a 100K interation loop, ontop of a fixed RW 256-1024 bytes chunk of ram. Perls not a MPEG decoder. It doesn't need x86 conditional jump's to be aligned to cache line multiples, and inside the function, 25%-40% of all bytes are NOP CPU instructions. Sounds to me MSVC had max R&D done targeting the Pentium 4 era, which is when they last redesigned/forklift-ed -O2 subsystem. Pentium 4s with very long pipelines/high latency, are long obsolete, and starting with Intel Core 1/2 to today, Smaller is better.
I do plan to add -Oi (memFOO()/strFOO()) intrinsics to the -O1 MSVC perl in near future. That feature is amazing, since "unaligned" libc memcmp() or memcpy() or memset() functions, with const/CC frozen inputs, around 2 bytes-16 bytes, somethings 24 bytes, all of these will optimize down to 1 cpu op with -Oi. I have -Oi turned on ever since I got back into perl Makes string parsing/sorting super fast,, and super tiny in machine code. I have plans in near future to add to add -Oi and -GW (>= VC2013 only feature, the linker will now (decades too late) remove unreferenced stuff after CC link phase, Specifically MSVC with -GW will remove const or RO static structs/arrays if they are unreferenced by any other ISO C function or C data structure. I'm NOT talking about ->>>> p = "when is your friend coming"; MSVC correctly from day 1 dedupes or removed nameless/symboless double quoted strings, but it DID NOT NOT NOT ever optimize away static const char warning[] = {"the house is burning"}; even if absolutely no references to the symbol. ="the house is burning"; will be guarenteed to show up as bloat in the final binary. -GW from 2013 finally fixed this. I also need to experiment with ripping out 2KB of "profile guided optimization" metadata in my libperl.dll and XS,DLLs. MS offers no in removing that 2KB data structure except by "do PGO with simulated workloads and recompile your binary with the results from the DB, Thats the only official way to remove the PGO data. There some undocumented command line switches floating I need to experiment with more, but I'm going off topic. Summary, I think it should stay on -O1 but with a couple rationally picked add-on, maybe even -Os or -Ot modes, just NOT Flordia Spring Break -O2 mode.
P5P has the macros in the source code already to do selective!!!!! (DONT YALL GET IDEAS!!!!) "emergency" inlining on MSVC platform with the current -O1 mode. I DO NOT WANT to see croak() unrolled/inlined all over the code base. So Im strongly against -O2.
Sitting down with an IDE and some of my other tools, and single stepping (or any other P5P person doing), and finding individual places to unroll/expand, including a Perl dev, and his brain, and background knowledge of what is run loop code, and what is "artic" panic/assert/overflow/"bizzare copy of" code, then rational decisions can be done function by function, on what is hot, what can be changed, add ultra inline decl tag or not? It had to be done by humans, MSVC's algorithms are too generic, and were design for very tiny in disk file space video codecs drivers or gaming engines with 99% FP SMID math workloads. Not Perl's mundane ETL usage which is almost nothing but compare/jump/move branching all day long.
B) These changes would likely get through review much faster if the PR was split up into 3 separate PRs:
Ill rebase them and PR split them. I suspect Ill be facing a wall of rebase conflicts if they go in separate, but do that path anyways (3 more PRs)
I', thinking of replacing some of the C switch trees (binary search), with U32es constants/imm32 of "logic" that used to be in that global table.
if(MallocerBodyNoArenaFlag &( 1 << SVtPVLV)) { return foo(); }
with code like this. O(1), not generic typica; CC way of 10 x cmp_()/jump_above()/jump_below() to do it as a C switch,.