pp_reverse - chunk-at-a-time string reversal
The performance characteristics of string reversal in blead is very variable depending upon the capabilities of the C compiler. Some compilers are able to vectorize some cases for better performance.
This commit introduces explicit reversal and swapping of whole registers at a time, which all builds seem to be able to benefit from.
The _swab_xx_ macros for doing this already exist in perl.h, using them for this purpose was inspired by
https://dev.to/wunk/fast-array-reversal-with-simd-j3p The bit shifting done by these macros should be portable and reasonably performant if not optimised further, but it is likely that they will be optimised to bswap, rev, movbe instructions.
Some performance comparisons:
1. Large string reversal, with different source & destination buffers
my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse $x }
gcc blead:
2,388.30 msec task-clock # 0.993 CPUs utilized
10,574,195,388 cycles # 4.427 GHz
61,520,672,268 instructions # 5.82 insn per cycle
10,255,049,869 branches # 4.294 G/sec
clang blead:
688.37 msec task-clock # 0.946 CPUs utilized
3,161,754,439 cycles # 4.593 GHz
8,986,420,860 instructions # 2.84 insn per cycle
324,734,391 branches # 471.745 M/sec
gcc patched:
408.39 msec task-clock # 0.936 CPUs utilized
1,617,273,653 cycles # 3.960 GHz
6,422,991,675 instructions # 3.97 insn per cycle
644,856,283 branches # 1.579 G/sec
clang patched:
397.61 msec task-clock # 0.924 CPUs utilized
1,655,838,316 cycles # 4.165 GHz
5,782,487,237 instructions # 3.49 insn per cycle
324,586,437 branches # 816.350 M/sec
2. Large string reversal, but reversing the buffer in-place
my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse "foo",$x }
gcc blead:
6,038.06 msec task-clock # 0.996 CPUs utilized
27,109,273,840 cycles # 4.490 GHz
41,987,097,139 instructions # 1.55 insn per cycle
5,211,350,347 branches # 863.083 M/sec
clang blead:
5,815.86 msec task-clock # 0.995 CPUs utilized
26,962,768,616 cycles # 4.636 GHz
47,111,208,664 instructions # 1.75 insn per cycle
5,211,117,921 branches # 896.018 M/sec
gcc patched:
1,003.49 msec task-clock # 0.999 CPUs utilized
4,298,242,624 cycles # 4.283 GHz
7,387,822,303 instructions # 1.72 insn per cycle
725,892,855 branches # 723.367 M/sec
clang patched:
970.78 msec task-clock # 0.973 CPUs utilized
4,436,489,695 cycles # 4.570 GHz
8,028,374,567 instructions # 1.81 insn per cycle
725,867,979 branches # 747.713 M/sec
3. Short string reversal, different source & destination (checking performance on smaller string reversals - note: this one's vary variable due to noise)
my $x = "1234567"; my $y; for (0..10_000_000) { $y = reverse $x }
gcc blead:
401.20 msec task-clock # 0.916 CPUs utilized
1,672,263,966 cycles # 4.168 GHz
5,564,078,603 instructions # 3.33 insn per cycle
1,250,983,219 branches # 3.118 G/sec
clang blead:
380.58 msec task-clock # 0.998 CPUs utilized
1,615,634,265 cycles # 4.245 GHz
5,583,854,366 instructions # 3.46 insn per cycle
1,300,935,443 branches # 3.418 G/sec
gcc patched:
381.62 msec task-clock # 0.999 CPUs utilized
1,566,807,988 cycles # 4.106 GHz
5,474,069,670 instructions # 3.49 insn per cycle
1,240,983,221 branches # 3.252 G/sec
clang patched:
346.21 msec task-clock # 0.999 CPUs utilized
1,600,780,787 cycles # 4.624 GHz
5,493,773,623 instructions # 3.43 insn per cycle
1,270,915,076 branches # 3.671 G/sec
- This set of changes requires a perldelta entry, and one will be added after the 5.42 release.
I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.
I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.
Thanks, I'll look into wrapping this in architecture-based preprocessor guards.
My PR at https://github.com/Perl/perl5/pull/23330 by coincidence is trying to ADD THE INTRINSICS to the Perl C API, that THIS PR WANTS to use. But the ticket has stalled.
50% of the reason there is no progress, is b/c I input or a vote is needed on the best name/identifier to use for a new macro.
50% of the reason is claims that Perl 5 has no valid production code reason to need to swap ASCII/binary bytes in the C level VM at runtime.
I don't actually thing we need
ntohllandhtonllOriginally posted by @Leont in https://github.com/Perl/perl5/pull/23330#discussion_r2107692323
I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.
Perl has had the ./Configure probe for HW UA
memory since 2001, but there are no P5P tuits to actually use the
#define.
my long write of 25 years of apathy https://github.com/Perl/perl5/issues/22886
some random related links https://github.com/Perl/perl5/issues/16680 https://github.com/Perl/perl5/commit/e8864dba80952684bf3afe83438d4eee0c3939a9
https://www.nntp.perl.org/group/perl.perl5.porters/2010/03/msg157755.html
Ive been complaining since 2015 about this.
https://www.nntp.perl.org/group/perl.perl5.porters/2015/11/msg232849.html
https://github.com/Perl/perl5/issues/12565 my PRs got rejected to add atomics and UA, for what? to protect non-existent hardware that was melted in Asia as eWaste 10-15 years. The needs of air gapped Perl users are irrelavent. The are airgapped an will never upgrage
Who Here can take a selfie of themselves in 2025 with a post 2010 mfg-ed server or desktop that DOES NOT have HW UA MEM read support?
Alpha in 1999 already had hardware supported UA mem reads aslong as you use the appropriate CC typedecls/declspecs/attributes to pick the "special" unaligned instruction.
I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.
Which ones? I'm requesting .pdf links to the official Arch ISA specs that you are discussing.
SPARC and earlier ARMv5 or so? Anyway, just write it idiomatically with memcpy and let the compiler notice it (or use builtins).
It also introduces more aliasing issues (e.g. src accessed with both U64* and U32*) which is a shame, as it brings Perl further away from dropping -fno-strict-aliasing.
I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.
Thanks, I'll look into wrapping this in architecture-based preprocessor guards.
Even this isn't a great idea, because while the implementation is free to use unaligned access, that doesn't mean it's okay in C. If it vectorises it, it may assume aligned access and then trap. That's happened a few times recently (subversion, numpy).
Unaligned access is undefined behaviour in C.
This can be an problem in practice even on x86, when the compiler decides to use aligned SSE/AVX instructions. As the author of this blogpost has found out.
The usual workaround is to use memcpy. For some algorithms (probably not this one) it's also possible to skip the first few bytes of input until the pointer is aligned.
Although it should be noted that memcpy doesn't help with the "unaligned reads are implemented in software and super slow" problem. I don't know which architectures suffer from this.
SPARC and earlier ARMv5 or so? Anyway, just write it idiomatically with memcpy and let the compiler notice it (or use builtins).
Wrong. After 15 minutes of trying I couldn't find unaligned HW mem R/W in SPARC V7 Sept 1988 ISA.
But I did find this.
SPARC V9 ISA May 2002
If P5P voted to Remove Windows 2000/XP/Server 2003 from Perl, that rule also applies to all same vintage Unix boxes from National Museum of Technology And Science of whatever country you live in.
ARM V5, or I call personally call an"ARM V4 CPU" is an obsolete eWaste/Museum CPU. iPhone 1, and Android 1.0 introduced ARM V7 ISA as the minimum platform. The last commercial use of ARM v4 ISA, was for the ubiquitous ARM7TDMI CPU that was inside LG Nokia Motorola and Samsung 2G and 3G flip phones, and that ARM7TDMI is was ran your J2ME flip phone apps. WinCE Perl was compiled as ARMv4 by default, I once ever set the MSVC cmd line arg to ARMv5 Thumb (2 Byte instruction format) as an experiment, but perl5XX.dll grew some KB in disk size, and I decided to never play around with ARMv5 Thumb for Perl again. Maybe >= 2010s ARM32 V5 Thumb compilers like Clang have the intelligence to compile each C function as 4 byte legacy ARMv4, or 2 byte ARMv5, and the C linker picks the smaller machine code side one at link time, but I've never heard of my wishlist C compiler feature actually existing in any production big name C compiler. In any case, ARM32 Thumb is EOL/obsolete in 2025. 2 byte instruction format was removed by ARM LLC between the ARM32 to ARM64 switch,
I've researching and created screen shots for ancient Alpha and ancient SPARC ISA so far, My ARM info above is from memory, not fresh research. I don't remember if ARM v4 has dedicated 1 op HW unaligned memory reads/writes, or some magical and fast enough multi opcode unaligned mem read write asm code recipe that is better and faster than U32 u32var = p[0] | p[1] << 8 | p[2] << 16 | p[3] << 24;. the recipe is something like * U32 u32var = (U32) ( *((U64*)(ptr & ~0x7) <<ROLL()<< ((ptr & 0x7) * 8) ); but I dont care and Perl doesn't care, thats a CC domain problem, not a Perl in C problem.
In any case, the 2 platforms @thesamesam mentioned have some non-ISO-C-Abstract-Machine CC vendor specific "fast enough" C grammar token to do unaligned mem R/Ws. Someone just has to find it in that C Compiler's man page.
FSF LLC and GNU LLC's proprietary CC vendor specific "fast enough" C grammar token to do unaligned mem R/Ws has the ASCII string name of memcpy. But this is proprietary to the GPL 3 repo published by GNU and isn't blindly universally portable to all other C compilers.
MSVC instead wants you to use *(( __unaligned unsigned __int64 *)p) on WinCE for ARM32 and Ancient WinNT IA64/Alpha/SPARC/MIPS. *(( __unaligned unsigned __int64 *)p) is a syntax error in a .i file for MSVC x64 mode, MSVC WinNT ARM32 mode and MSVC WinNT ARM64 mode. Mingw GCC's and MSVC's .h headers instantly #define __unaligned \n to nothing. MSVC i386 mode .i mode ignored __unaligned but doesn't syntax error like MSVC for x64 / NT ARM32 / NT ARM64. I researched all of this, but never tried to prove this as true. Its irrelavent for me, my C/C++ private code, or Perl C code. I and everyone should just write the __unaligned if MSVC is detected even if it does nothing nowadays and MS's std headers NOOP it away. Maybe one day MS will reintroduce __unaligned as best practice or mandatory. Who knows, its harmless to write it in the 2000s-2025 time period.
If we are arguing about ISO C Abstract Virtual Machine CPU compliace. The C compiler published by GNU/FSF is INVALID and non-conformant to ISO C.
Easy demonstration: Add a char * memcpy(char* d, const char * s, size_t n) {} to perl's util.c file that logs every single call to STDERR. If any executions of memcpy() in Perl's C code, do not appear in my shell console, the C Compiler is NON-CONFORMANT and INVALID and should be promptly blocked and blacklisted in perl's ./Configure.
Perl's only project scope is to run on modern enough hardware, not on the https://en.wikipedia.org/wiki/AN/FSQ-7_Combat_Direction_Central OS or ISO C Abstract Virtual OS or https://en.wikipedia.org/wiki/Apple_II
Wrong. After 15 minutes of trying I couldn't find unaligned HW mem R/W in SPARC V7 Sept 1988 ISA.
I was saying that SPARC aggressively traps on unaligned access, not that it has instructions for that.
Wrong. After 15 minutes of trying I couldn't find unaligned HW mem R/W in SPARC V7 Sept 1988 ISA.
I was saying that SPARC aggressively traps on unaligned access, not that it has instructions for that.
Yeah, RISC CPUs are expected and its normal for them to throw an exception/sig handler each time if you use do a vanilla C machine type >1 byte mem read write in C lang on an unaligned addr. My argument is that all CCs have some magic special token/grammer way on how to safely and quickly do unaligned U16/U32/U64 mem reads and writes. So the end user dev just has to write those special magical tokens/grammer things as needed in C src code and the issue goes away. = ( [] << | [] << | [] << | [] ) is the worst possible but still acceptable solution to use in C to deal with UA memory reads/writes of > 1 byte data values.
I vaguely remember ARMv4's/WinCE Perl, if someone does an unaligned U16 or U32 mem read on ARMv4, you get a free shift and a free mask as a gift from ARMv4. You don't get a SEGV on ARMv4 from UA mem reads. I think reason was, not going to google it, was b/c this was how ARMv4 wanted you to do UA mem reads.
U32 u32_var = *(U32*) ua_32p
| *((U32*) (((char*) ua_32p)+3) );
Not safely and correctly SEGVing like I assumed WinCE Perl on ARM would, but instead finding a junk integer in my MSVC watch window, while step debugging in VS IDE was a learning experience for me.
Although it should be noted that memcpy doesn't help with the "unaligned reads are implemented in software and super slow" problem. I don't know which architectures suffer from this.
@bulk88 makes note to self to finally turn on intrinsic memcpy(d,s, n <= 16); optimization for all MSVC compilers. @bulk88 doesn't know why he didn't make that PR on day 1 of his WinPerl in C hobby, since all his day job employer's binaries made with MSVC have it turned on.
Although it should be noted that memcpy doesn't help with the "unaligned reads are implemented in software and super slow" problem. I don't know which architectures suffer from this.
A bunch of links to the topic. Since C language UA mem read/writes are Day 1 of Win32 Platform Requirement, All RISC WinNT kernels ever compiled, will trap, emulate and resume the CPU's SIGILL/SIGBUS, with I can't the find the article on google, but it was a 100x or 1000x wall time delay for NT kernel to emulate UA mem reads, vs the C compiler emitting 3-6 more CPU opcodes in the binary when the dev declares an UA C machine type.
A bunch of links, the short summary is, go read your C compiler's docs on how to write your C code correctly using VENDOR SPECIFIC C syntax.
https://devblogs.microsoft.com/oldnewthing/20170814-00/?p=96806 https://devblogs.microsoft.com/oldnewthing/20220810-00/?p=106958 https://devblogs.microsoft.com/oldnewthing/20180409-00/?p=98465 https://devblogs.microsoft.com/oldnewthing/20210611-00/?p=105299 https://devblogs.microsoft.com/oldnewthing/20190821-00/?p=102794 https://devblogs.microsoft.com/oldnewthing/20200103-00/?p=103290 https://devblogs.microsoft.com/oldnewthing/20250605-00/?p=111250 https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/avoiding-misalignment-of-fixed-precision-data-types https://learn.microsoft.com/en-us/windows/win32/winprog64/fault-alignments https://wiki.debian.org/ArmEabiFixes#word_accesses_must_be_aligned_to_a_multiple_of_their_size https://web.archive.org/web/20090204204507/http://lecs.cs.ucla.edu/wiki/index.php/XScale_alignment https://devblogs.microsoft.com/oldnewthing/20040830-00/?p=38013
FWIW, here is a list of instruction my Win 7 x64 Kernel is capable of emulating. IDK why this emulation code exists or why this switch ladder exists in my Win 7 x64 kernel, I don't have a reason to need to know this, Remember this is closed source SW. But there must be bizarre, very rare cases, where for whatever reason, the Win 7 x64 kernel needs to silently fix up the CPU hardware fault/signal/interrut/exception/event handler whatever u call it.
Update, this instruction list may or may not have something to do with an x64 CPU executing temporarily in 16 bit Real Mode, probably for WinNT's Win16 emulator ntvdm.exe. Good enough answer for my curiosity to be satisfied.
I will also add, Perl/P5P devs might be be able to in C lang, in certain permutations of CC OS and CPU, outsmart a C compiler's native UA mem read/write "best practices" algorithm. The engineering reason is, is it legal, to use 2 aligned U32 mem reads to read an unaligned U32?
What if a unknown unnamed CPU has a hardware MMU virtual memory page size of 1 byte?
You can get one of these CPUs with a 1 byte page size MMU right now for free at https://github.com/google/sanitizers/wiki/addresssanitizer or https://en.wikipedia.org/wiki/Valgrind
In real life, aslong as you aren't over reading at the very end of a malloc block at address 4096-1 or 4096-2, 2 x U32 aligned reads are harmless. Malloc() won't hand out a 4 or 5 byte aligned memory block pressed against a 4096 boundary when you execute malloc(5).
IDK of a real world malloc() impl that hands out raw 4096 aligned mempages for request >= 4096, with a unmapped mem page at `malloc(0x1000)-1.
But some should give the real rev engineered meaning of this OSX API doc, read section "Allocating Large Memory Blocks using Malloc" at https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/MemoryAlloc.html#//apple_ref/doc/uid/20001881-99765
But regarding the perl C API, remember that a fictional malloc(1) pressed against a 4096 boundary is illegal and impossible in the Perl C API. A fictional malloc(1) means SvPVX(sv)[0] with that sv being SvCUR(sv) == 0 || SvCUR(sv) == 1 is not '\0' null terminated, and strlen(SvPVX(sv)) is 100% guaranteed to SEGV. Perl could never execute on such an architecture or such a libc impl.
Here is a reason why memcpy() and memcmp() can't be blindly assumed to be a magical UA memory compliance tool and a C dev STILL needs to read their CC's vendor specific docs no matter what.
Here is a list of all call sites to memcmp() symbol in the UCRT dll from my -O1 MSVC blead perl.
A couple calls in 1 func are fine, but anything over 5 calls points to big problems with P5P macros and C code, example of 1 func
S_handle_possible_posix+1013
S_handle_possible_posix+1049
S_handle_possible_posix+107B
S_handle_possible_posix+10A8
S_handle_possible_posix+10D5
S_handle_possible_posix+1102
S_handle_possible_posix+112F
S_handle_possible_posix+1155
S_handle_possible_posix+1176
S_handle_possible_posix+1199
S_handle_possible_posix+F3C
S_handle_possible_posix+FB5
S_handle_possible_posix+FD0
S_handle_possible_posix+FEB
I know someone assumed GCC's inline memcmp() optimization is part of the ISO C TC's reference unit tests for C89. It is not.
full list of all call sites to memcmp() is hidden below
It also introduces more aliasing issues (e.g.
srcaccessed with bothU64*andU32*) which is a shame, as it brings Perl further away from dropping-fno-strict-aliasing.
I don't believe its possible to write production grade C not C++ SW without -fno-strict-aliasing. You would be removing (type_t) operator from C89/C99's grammar to write a non crashing -fyes-strict-aliasing. Perl isn't a .o/TU file holding the hottest guts of a AES MPEG JPG codec where -fyes-strict-aliasing could make a benchmarkable improvement.
Either don't use GCC's -O3 / -O4 flag (-O2 ??? -O1 ???) or understand you have to live with adding the -fno-strict-aliasing flag. JAPHs, GOLFs, and educational demos and samples arent production grade C SW.
If you remove (type_t) and & operators from C, you now have the JavaScript ISA/WASM/ECMAScript/V8 virtual machine and those aren't C, even if an AES decryption lib written in JS/nodeJS happens to benchmark identical to a GNU C AES decryption lib on 1 particular build number of Chrome/Node engine.
I'll also add byte order swapping, more specifically swapping the byte order of a 64 bit variable aka htonll()/ntohll(), was a patented technology until 2022 and perl has been doing some low level SW piracy for decades at
https://github.com/Perl/perl5/blob/blead/perl.h#L4601
see this patent by Intel and compare it to Perl's math operator's above
https://patents.google.com/patent/US20040010676A1/en
cough cough pro- or anti-patent troll cough cough
MSVC fix branch pushed at https://github.com/Perl/perl5/pull/23374
Reasons for all the MSVC specific code is, here is very poor code gen round 1, after getting the syntax errors fixed.
48 8B D1 mov rdx, rcx
4C 8B C7 mov r8, rdi ; Size
48 8D 4D F7 lea rcx, [rbp+57h+var_60] ; Dst
FF 15 8E 85 09 00 call cs:__imp_memcpy
48 8B 55 DF mov rdx, [rbp+57h+Src] ; Src
48 8D 4D EF lea rcx, [rbp+57h+Dst] ; Dst
4C 8B C7 mov r8, rdi ; Size
FF 15 7D 85 09 00 call cs:__imp_memcpy
48 8B 55 F7 mov rdx, [rbp+57h+var_60]
41 B9 00 FF 00 00 mov r9d, 0FF00h
4C 8B C2 mov r8, rdx
48 8B C2 mov rax, rdx
48 C1 E8 10 shr rax, 10h
4D 23 C7 and r8, r15
4C 0B C0 or r8, rax
48 8B CA mov rcx, rdx
49 C1 E8 10 shr r8, 10h
48 8B C2 mov rax, rdx
48 23 C3 and rax, rbx
48 C1 E1 10 shl rcx, 10h
4C 0B C0 or r8, rax
41 BA 00 00 FF 00 mov r10d, 0FF0000h
49 C1 E8 10 shr r8, 10h
48 8B C2 mov rax, rdx
49 23 C5 and rax, r13
4C 0B C0 or r8, rax
48 8B C2 mov rax, rdx
49 23 C1 and rax, r9
49 C1 E8 08 shr r8, 8
48 0B C8 or rcx, rax
48 8B C2 mov rax, rdx
48 C1 E1 10 shl rcx, 10h
49 23 C2 and rax, r10
48 0B C8 or rcx, rax
49 23 D6 and rdx, r14
48 C1 E1 10 shl rcx, 10h
48 0B CA or rcx, rdx
48 8B 55 EF mov rdx, [rbp+57h+Dst]
48 C1 E1 08 shl rcx, 8
48 8B C2 mov rax, rdx
4C 0B C1 or r8, rcx
48 C1 E8 10 shr rax, 10h
4C 89 45 F7 mov [rbp+57h+var_60], r8
48 8B CA mov rcx, rdx
48 C1 E1 10 shl rcx, 10h
4C 8B C2 mov r8, rdx
4D 23 C7 and r8, r15
4C 0B C0 or r8, rax
48 8B C2 mov rax, rdx
48 23 C3 and rax, rbx
49 C1 E8 10 shr r8, 10h
4C 0B C0 or r8, rax
48 8B C2 mov rax, rdx
49 23 C5 and rax, r13
49 C1 E8 10 shr r8, 10h
4C 0B C0 or r8, rax
48 8B C2 mov rax, rdx
49 23 C1 and rax, r9
49 C1 E8 08 shr r8, 8
48 0B C8 or rcx, rax
48 8B C2 mov rax, rdx
48 C1 E1 10 shl rcx, 10h
49 23 D6 and rdx, r14
49 23 C2 and rax, r10
48 0B C8 or rcx, rax
48 C1 E1 10 shl rcx, 10h
48 0B CA or rcx, rdx
48 8D 55 EF lea rdx, [rbp+57h+Dst] ; Src
48 C1 E1 08 shl rcx, 8
4C 0B C1 or r8, rcx
48 8B 4D CF mov rcx, [rbp+57h+e] ; Dst
4C 89 45 EF mov [rbp+57h+Dst], r8
4C 8B C7 mov r8, rdi ; Size
FF 15 98 84 09 00 call cs:__imp_memcpy
48 8B 4D DF mov rcx, [rbp+57h+Src] ; Dst
48 8D 55 F7 lea rdx, [rbp+57h+var_60] ; Src
4C 8B C7 mov r8, rdi ; Size
FF 15 87 84 09 00 call cs:__imp_memcpy
48 01 7D DF add [rbp+57h+Src], rdi
48 03 F7 add rsi, rdi
4C 2B E7 sub r12, rdi
48 8B 4D CF mov rcx, [rbp+57h+e]
49 8B C4 mov rax, r12
48 2B CF sub rcx, rdi
48 2B C6 sub rax, rsi
48 89 4D CF mov [rbp+57h+e], rcx
48 83 F8 10 cmp rax, 10h
0F 83 C4 FE FF FF jnb loc_140090DA2
here is round 2 after turning on inline memcpy, still bad code gen
4D 8B 0C 13 mov r9, [r11+rdx]
49 83 EA 08 sub r10, 8
49 8B 14 0B mov rdx, [r11+rcx]
48 83 C7 08 add rdi, 8
4C 8B C2 mov r8, rdx
48 8B C2 mov rax, rdx
48 C1 E8 10 shr rax, 10h
48 8B CA mov rcx, rdx
48 C1 E1 10 shl rcx, 10h
4C 23 C3 and r8, rbx
4C 0B C0 or r8, rax
48 8B C2 mov rax, rdx
49 23 C6 and rax, r14
49 C1 E8 10 shr r8, 10h
4C 0B C0 or r8, rax
48 8B C2 mov rax, rdx
48 23 C6 and rax, rsi
49 C1 E8 10 shr r8, 10h
4C 0B C0 or r8, rax
48 8B C2 mov rax, rdx
49 23 C4 and rax, r12
49 C1 E8 08 shr r8, 8
48 0B C8 or rcx, rax
48 8B C2 mov rax, rdx
49 23 C5 and rax, r13
48 C1 E1 10 shl rcx, 10h
48 0B C8 or rcx, rax
49 23 D7 and rdx, r15
48 8B 45 58 mov rax, [rbp+dsv]
48 C1 E1 10 shl rcx, 10h
48 0B CA or rcx, rdx
49 8B D1 mov rdx, r9
48 C1 E1 08 shl rcx, 8
48 23 D3 and rdx, rbx
4C 0B C1 or r8, rcx
49 8B C9 mov rcx, r9
4C 89 00 mov [rax], r8
49 8B C1 mov rax, r9
48 C1 E8 10 shr rax, 10h
48 0B D0 or rdx, rax
48 C1 E1 10 shl rcx, 10h
48 C1 EA 10 shr rdx, 10h
49 8B C1 mov rax, r9
49 23 C6 and rax, r14
48 0B D0 or rdx, rax
49 8B C1 mov rax, r9
48 23 C6 and rax, rsi
48 C1 EA 10 shr rdx, 10h
48 0B D0 or rdx, rax
49 8B C1 mov rax, r9
49 23 C4 and rax, r12
48 C1 EA 08 shr rdx, 8
48 0B C8 or rcx, rax
49 8B C1 mov rax, r9
48 C1 E1 10 shl rcx, 10h
49 23 C5 and rax, r13
48 0B C8 or rcx, rax
4D 23 CF and r9, r15
48 C1 E1 10 shl rcx, 10h
49 8B C2 mov rax, r10
49 0B C9 or rcx, r9
48 2B C7 sub rax, rdi
48 C1 E1 08 shl rcx, 8
48 0B D1 or rdx, rcx
48 8B 4D 60 mov rcx, [rbp+arg_18]
48 89 11 mov [rcx], rdx
48 83 C1 08 add rcx, 8
48 8B 55 58 mov rdx, [rbp+dsv]
48 83 EA 08 sub rdx, 8
48 89 4D 60 mov [rbp+arg_18], rcx
48 89 55 58 mov [rbp+dsv], rdx
48 83 F8 10 cmp rax, 10h
0F 83 06 FF FF FF jnb loc_14009093C
after all of my tweaks it looks perfect or identical to whatever GCC and Clang would emit
4B 8B 04 08 mov rax, [r8+r9]
48 83 EA 08 sub rdx, 8
4B 8B 0C 10 mov rcx, [r8+r10]
48 83 C7 08 add rdi, 8
48 0F C8 bswap rax
49 89 02 mov [r10], rax
4D 8D 52 F8 lea r10, [r10-8]
48 8B C2 mov rax, rdx
48 2B C7 sub rax, rdi
48 0F C9 bswap rcx
49 89 09 mov [r9], rcx
4D 8D 49 08 lea r9, [r9+8]
48 83 F8 10 cmp rax, 10h
BEFORE 5.41.7
C:\sources\perl5>timeit C:\pb64\bin\perl.exe -e "my $x = \"X\"x(1024*1000*10);
my $y; for (0..1_000) { $y = reverse \"foo\",$x }"
Exit code : 0
Elapsed time : 10.20
Kernel time : 2.76 (27.1%)
User time : 7.33 (71.9%)
page fault # : 2511948
Working set : 35628 KB
Paged pool : 92 KB
Non-paged pool : 7 KB
Page file size : 31732 KB
AFTER
C:\sources\perl5>timeit perl -Ilib -e "my $x = \"X\"x(1024*1000*10); my $y; for
(0..1_000) { $y = reverse \"foo\",$x }"
Exit code : 0
Elapsed time : 5.64
Kernel time : 2.96 (52.6%)
User time : 2.67 (47.3%)
page fault # : 2511998
Working set : 35808 KB
Paged pool : 94 KB
Non-paged pool : 8 KB
Page file size : 31752 KB
My PR at #23330 by coincidence is trying to ADD THE INTRINSICS to the Perl C API, that THIS PR WANTS to use. But the ticket has stalled.
50% of the reason there is no progress, is b/c I input or a vote is needed on the best name/identifier to use for a new macro.
50% of the reason is claims that Perl 5 has no valid production code reason to need to swap ASCII/binary bytes in the C level VM at runtime.
I don't actually thing we need
ntohllandhtonllOriginally posted by @Leont in #23330 (comment)
I'll reiterate this PR as written, and my MSVC patch ontop of the Richard branch that makes this PR pass all GI smoke tests, is still an unclean design. Perl_pp_reverse() shouldn't been keeping platform CC OS secrets inside it on how to do a fast efficient CPU native htonll(). There needs to be a centralized formal Perl C platform API how to do this as described in https://github.com/Perl/perl5/pull/23330 .
After I reproduced on Win64 the huge performance boost Richard showed in the OP on Linux, this optimization is absolutely needed. I very often use PP reverse to write PP code that de-dupes groups of '\0' terminated C strings, so shorter C strings, that are perfect suffixes of longer C strings, are de-duped into the longer C strings. Example, short C string "QUERY_METADATA_HANDLE" becomes "EXTRA_QUERY_METADATA_HANDLE"+5 in the src code.
Just trying to understand where this PR is at now:
- I think it the two wide-reverse sections can (probably) be merged with a bit of refactoring.
- I should use the
my_swapxxfunctions instead of_swab_xx_? - Windows support would have to wait for #23330 ?
- Is the current state of this PR non-portable anywhere else that we are aware of?
I can't speak for Windows but the state of the code with https://github.com/Perl/perl5/pull/23374 on top looks right to me.
Just trying to understand where this PR is at now:
- I think it the two wide-reverse sections can (probably) be merged with a bit of refactoring.
No opinion. 2 x builtin_bwapU32() #ifdef branch probably needs to stay. Remember 64b IVs on i386 and FooCPU32 are CC emulated. 95% chance FooCPU32 arch won't have a native HW 64 bit asm lang byteswap CPU opcode. RIP AMD x32 and Alpha 32 archs probably have builtin_bwapU64(). IDK/IDC what GCC's and Clang's policy is for SW emulating builtin_bwapU64() on 32 bit ptr CPU archs.
Plus IDK how good or bad GCC's and Clang's SW emulating builtin_bwapU64() on 32 bit ptr CPU archs. I know modern blead perl requires a U64 type on all build permutations. But that P5P requirement says absolutely nothing about fast/slow-ness of FooCC's impl of U64 type on 32 bit ptr CPU archs.
- I should use the
my_swapxxfunctions instead of_swab_xx_?
Yes, use whatever P5P abstraction ./Configure/perl.h/handy.h/util.c/perl.c/inline.h want you to use. Don't DIY something new just for pp_reverse that nothing else in Perl ecosystem will ever use or know about.
- Windows support would have to wait for Win32: htonl/htons/ntohl/ntohs change slow winsock exports -> 1 CPU op/ins #23330 ?
Meh.
It is a cleaner git history, less repo noise, less git blame noise, if the above PR with a centralized byte swapping API is pushed to blead, before this PR is pushed to blead. I don't personally care in what order this PR vs Win32: htonl/htons/ntohl/ntohs change slow winsock exports -> 1 CPU op/ins #23330 is approved and applied to blead. If Win32: htonl/htons/ntohl/ntohs change slow winsock exports -> 1 CPU op/ins #23330 is bike/yak shaving, but this Richard PR is dead quiet, this Richard PR should go first.
Not bulk88's problem to manage logistics, I am not PSC/the pumpking.
- Is the current state of this PR non-portable anywhere else that we are aware of?
GCC project authors do not currently have a monopoly on the ISO C TC, and will never obtain that monopoly post ~2015 / ~2020. They can't undo their GPL 3 change. But in various statements, I'm not hunting for links ATM, GCC project authors said memcpy() is the universal serializing operator for ISO C.
Some senior GCC dev said in a WWW post/article, GCC's policy is memcpy() the direct equivalent of .toString() or .toJSON(), and GCC's policy/position statement is that any attempt to use ISO C's union grammar token, or attempting to use a typedef with a #pragma packed(1) and a union, either on GCC or any CC, is against GCC's official policy/position statement.
IDK and I am not going to research, what the other "C compiler project devs" technical policy/position statements are. I'm not a lawyer collecting discovery, depositions, and exhibits for trial.
Read my "attempting to use a typedef with a #pragma packed(1) and a union" sentence very carefully. Some ISO C/ANSI/IETF/IEEE/FAANG related humans say that is correct and GCC project is blatantly wrong.
Assuming memcpy() is ISO C's synonym for fast unaligned mem reads/writes in ASM lang, is highly illegal !!!
Reason is because inlining libc.sos external linkage memcpy() symbol violates all of the /bin/ld rules and all of the LD_PRELOAD/ELF Sym Interposition rules.
Real world dangers, what will these compilers do when blead Perl does a memcpy(d, s, 4); ?
-TinyCC -HPUX's aCC -Solaris, whatever CC it uses -VMS, whatever CC it uses -IBM's Z/OS's and AIX's xLC https://en.wikipedia.org/wiki/IBM_XL_C/C%2B%2B_Compilers -MSVC [contact @bulk88 to quickly fix it] -Arm LLC's ArmCC -Intel C for GCC/Posix enviroment -Intel C for MSVC [egh .... @bulk88 hasn't tried a build for 7 years, @bulk88 's icc is from 2013, worst case ICC 4 VC will do exactly what mainline MSVC will do]
What other Unix C compilers did I miss?
Some senior GCC dev said in a WWW post/article, GCC's policy is
memcpy()the direct equivalent of.toString()or.toJSON(), and GCC's policy/position statement is that any attempt to use ISO C'suniongrammar token, or attempting to use a typedef with a#pragma packed(1)and aunion, either on GCC or any CC, is against GCC's official policy/position statement.IDK and I am not going to research, what the other "C compiler project devs" technical policy/position statements are. I'm not a lawyer collecting discovery, depositions, and exhibits for trial.
Read my "attempting to use a typedef with a
#pragma packed(1)and aunion" sentence very carefully. Some ISO C/ANSI/IETF/IEEE/FAANG related humans say that is correct and GCC project is blatantly wrong.Assuming
memcpy()isISO C's synonym for fast unaligned mem reads/writes in ASM lang, is highly illegal !!!Reason is because inlining
libc.sos external linkagememcpy()symbol violates all of the/bin/ldrules and all of theLD_PRELOAD/ELF Sym Interposition rules.
According to https://developer.arm.com/documentation/ka003038/latest/ ARM LLC's official policy is typedef/packed/union/pointer cast is the correct C lang tool to use for safe unaligned memory access on their platform. They omitted any reference to GCC's policy/opinion of using linker function symbol memcpy(). I'll note, this makes sense, since ARM corporate probably isn't in a position, to tell end users in their API docs, to call some unknown author unspecified libc implementation's of memcpy() on an unknown OS, that ARM and ARM C/C++ Compiler's devs' have little to no control over.
Assuming
memcpy()isISO C's synonym for fast unaligned mem reads/writes in ASM lang, is highly illegal !!!Reason is because inlining
libc.sos external linkagememcpy()symbol violates all of the/bin/ldrules and all of theLD_PRELOAD/ELF Sym Interposition rules.Real world dangers, what will these compilers do when blead Perl does a
memcpy(d, s, 4);?
https://github.com/torvalds/linux/blob/v3.13/arch/x86/vdso/vclock_gettime.c#L272
Linux doesn't trust GCC's memcpy() identifier.
The kernel is in a different position to Perl where the kernel will get build failures w/o inlined memcpy.
https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/
Unless there are any showstopping mistakes in this PR, I'd like to merge it soon.
(Any further finessing can come in follow-up PRs.)