perl5 icon indicating copy to clipboard operation
perl5 copied to clipboard

pp_reverse - chunk-at-a-time string reversal

Open richardleach opened this issue 6 months ago • 21 comments

The performance characteristics of string reversal in blead is very variable depending upon the capabilities of the C compiler. Some compilers are able to vectorize some cases for better performance.

This commit introduces explicit reversal and swapping of whole registers at a time, which all builds seem to be able to benefit from.

The _swab_xx_ macros for doing this already exist in perl.h, using them for this purpose was inspired by https://dev.to/wunk/fast-array-reversal-with-simd-j3p The bit shifting done by these macros should be portable and reasonably performant if not optimised further, but it is likely that they will be optimised to bswap, rev, movbe instructions.

Some performance comparisons:

1. Large string reversal, with different source & destination buffers my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse $x }

gcc blead:

          2,388.30 msec task-clock                       #    0.993 CPUs utilized
    10,574,195,388      cycles                           #    4.427 GHz
    61,520,672,268      instructions                     #    5.82  insn per cycle
    10,255,049,869      branches                         #    4.294 G/sec

clang blead:

            688.37 msec task-clock                       #    0.946 CPUs utilized
     3,161,754,439      cycles                           #    4.593 GHz
     8,986,420,860      instructions                     #    2.84  insn per cycle
       324,734,391      branches                         #  471.745 M/sec

gcc patched:

            408.39 msec task-clock                       #    0.936 CPUs utilized
     1,617,273,653      cycles                           #    3.960 GHz
     6,422,991,675      instructions                     #    3.97  insn per cycle
       644,856,283      branches                         #    1.579 G/sec

clang patched:

            397.61 msec task-clock                       #    0.924 CPUs utilized
     1,655,838,316      cycles                           #    4.165 GHz
     5,782,487,237      instructions                     #    3.49  insn per cycle
       324,586,437      branches                         #  816.350 M/sec

2. Large string reversal, but reversing the buffer in-place my $x = "X"x(1024*1000*10); my $y; for (0..1_000) { $y = reverse "foo",$x }

gcc blead:

          6,038.06 msec task-clock                       #    0.996 CPUs utilized
    27,109,273,840      cycles                           #    4.490 GHz
    41,987,097,139      instructions                     #    1.55  insn per cycle
     5,211,350,347      branches                         #  863.083 M/sec

clang blead:

          5,815.86 msec task-clock                       #    0.995 CPUs utilized
    26,962,768,616      cycles                           #    4.636 GHz
    47,111,208,664      instructions                     #    1.75  insn per cycle
     5,211,117,921      branches                         #  896.018 M/sec

gcc patched:

          1,003.49 msec task-clock                       #    0.999 CPUs utilized
     4,298,242,624      cycles                           #    4.283 GHz
     7,387,822,303      instructions                     #    1.72  insn per cycle
       725,892,855      branches                         #  723.367 M/sec

clang patched:

            970.78 msec task-clock                       #    0.973 CPUs utilized
     4,436,489,695      cycles                           #    4.570 GHz
     8,028,374,567      instructions                     #    1.81  insn per cycle
       725,867,979      branches                         #  747.713 M/sec

3. Short string reversal, different source & destination (checking performance on smaller string reversals - note: this one's vary variable due to noise) my $x = "1234567"; my $y; for (0..10_000_000) { $y = reverse $x }

gcc blead:

            401.20 msec task-clock                       #    0.916 CPUs utilized
     1,672,263,966      cycles                           #    4.168 GHz
     5,564,078,603      instructions                     #    3.33  insn per cycle
      1,250,983,219      branches                         #    3.118 G/sec

clang blead:

            380.58 msec task-clock                       #    0.998 CPUs utilized
     1,615,634,265      cycles                           #    4.245 GHz
     5,583,854,366      instructions                     #    3.46  insn per cycle
     1,300,935,443      branches                         #    3.418 G/sec

gcc patched:

            381.62 msec task-clock                       #    0.999 CPUs utilized
     1,566,807,988      cycles                           #    4.106 GHz
     5,474,069,670      instructions                     #    3.49  insn per cycle
     1,240,983,221      branches                         #    3.252 G/sec

clang patched:

            346.21 msec task-clock                       #    0.999 CPUs utilized
     1,600,780,787      cycles                           #    4.624 GHz
     5,493,773,623      instructions                     #    3.43  insn per cycle
     1,270,915,076      branches                         #    3.671 G/sec

  • This set of changes requires a perldelta entry, and one will be added after the 5.42 release.

richardleach avatar Jun 13 '25 23:06 richardleach

I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.

t-a-k avatar Jun 14 '25 05:06 t-a-k

I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.

Thanks, I'll look into wrapping this in architecture-based preprocessor guards.

richardleach avatar Jun 15 '25 00:06 richardleach

My PR at https://github.com/Perl/perl5/pull/23330 by coincidence is trying to ADD THE INTRINSICS to the Perl C API, that THIS PR WANTS to use. But the ticket has stalled.

50% of the reason there is no progress, is b/c I input or a vote is needed on the best name/identifier to use for a new macro.

50% of the reason is claims that Perl 5 has no valid production code reason to need to swap ASCII/binary bytes in the C level VM at runtime.

I don't actually thing we need ntohll and htonll

Originally posted by @Leont in https://github.com/Perl/perl5/pull/23330#discussion_r2107692323

bulk88 avatar Jun 15 '25 12:06 bulk88

I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.

Perl has had the ./Configure probe for HW UA alphaldst memory since 2001, but there are no P5P tuits to actually use the #define.

my long write of 25 years of apathy https://github.com/Perl/perl5/issues/22886

some random related links https://github.com/Perl/perl5/issues/16680 https://github.com/Perl/perl5/commit/e8864dba80952684bf3afe83438d4eee0c3939a9

https://www.nntp.perl.org/group/perl.perl5.porters/2010/03/msg157755.html

Ive been complaining since 2015 about this.

https://www.nntp.perl.org/group/perl.perl5.porters/2015/11/msg232849.html

https://github.com/Perl/perl5/issues/12565 my PRs got rejected to add atomics and UA, for what? to protect non-existent hardware that was melted in Asia as eWaste 10-15 years. The needs of air gapped Perl users are irrelavent. The are airgapped an will never upgrage

Who Here can take a selfie of themselves in 2025 with a post 2010 mfg-ed server or desktop that DOES NOT have HW UA MEM read support?

Alpha in 1999 already had hardware supported UA mem reads aslong as you use the appropriate CC typedecls/declspecs/attributes to pick the "special" unaligned instruction.

bulk88 avatar Jun 15 '25 13:06 bulk88

I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.

Which ones? I'm requesting .pdf links to the official Arch ISA specs that you are discussing.

bulk88 avatar Jun 15 '25 14:06 bulk88

SPARC and earlier ARMv5 or so? Anyway, just write it idiomatically with memcpy and let the compiler notice it (or use builtins).

It also introduces more aliasing issues (e.g. src accessed with both U64* and U32*) which is a shame, as it brings Perl further away from dropping -fno-strict-aliasing.

thesamesam avatar Jun 16 '25 00:06 thesamesam

I'm afraid that the patched code may cause portability issues, as it unconditionally uses unaligned memory accesses which may trigger severe performance loss or OS exceptions (SIGBUS) on some non-x86 architectures.

Thanks, I'll look into wrapping this in architecture-based preprocessor guards.

Even this isn't a great idea, because while the implementation is free to use unaligned access, that doesn't mean it's okay in C. If it vectorises it, it may assume aligned access and then trap. That's happened a few times recently (subversion, numpy).

thesamesam avatar Jun 16 '25 00:06 thesamesam

Unaligned access is undefined behaviour in C.

This can be an problem in practice even on x86, when the compiler decides to use aligned SSE/AVX instructions. As the author of this blogpost has found out.

The usual workaround is to use memcpy. For some algorithms (probably not this one) it's also possible to skip the first few bytes of input until the pointer is aligned.

xenu avatar Jun 16 '25 00:06 xenu

Although it should be noted that memcpy doesn't help with the "unaligned reads are implemented in software and super slow" problem. I don't know which architectures suffer from this.

xenu avatar Jun 16 '25 01:06 xenu

SPARC and earlier ARMv5 or so? Anyway, just write it idiomatically with memcpy and let the compiler notice it (or use builtins).

Wrong. After 15 minutes of trying I couldn't find unaligned HW mem R/W in SPARC V7 Sept 1988 ISA.

But I did find this.

SPARC V9 ISA May 2002

sparcv9_1

sparcv9_2

If P5P voted to Remove Windows 2000/XP/Server 2003 from Perl, that rule also applies to all same vintage Unix boxes from National Museum of Technology And Science of whatever country you live in.

ARM V5, or I call personally call an"ARM V4 CPU" is an obsolete eWaste/Museum CPU. iPhone 1, and Android 1.0 introduced ARM V7 ISA as the minimum platform. The last commercial use of ARM v4 ISA, was for the ubiquitous ARM7TDMI CPU that was inside LG Nokia Motorola and Samsung 2G and 3G flip phones, and that ARM7TDMI is was ran your J2ME flip phone apps. WinCE Perl was compiled as ARMv4 by default, I once ever set the MSVC cmd line arg to ARMv5 Thumb (2 Byte instruction format) as an experiment, but perl5XX.dll grew some KB in disk size, and I decided to never play around with ARMv5 Thumb for Perl again. Maybe >= 2010s ARM32 V5 Thumb compilers like Clang have the intelligence to compile each C function as 4 byte legacy ARMv4, or 2 byte ARMv5, and the C linker picks the smaller machine code side one at link time, but I've never heard of my wishlist C compiler feature actually existing in any production big name C compiler. In any case, ARM32 Thumb is EOL/obsolete in 2025. 2 byte instruction format was removed by ARM LLC between the ARM32 to ARM64 switch,

I've researching and created screen shots for ancient Alpha and ancient SPARC ISA so far, My ARM info above is from memory, not fresh research. I don't remember if ARM v4 has dedicated 1 op HW unaligned memory reads/writes, or some magical and fast enough multi opcode unaligned mem read write asm code recipe that is better and faster than U32 u32var = p[0] | p[1] << 8 | p[2] << 16 | p[3] << 24;. the recipe is something like * U32 u32var = (U32) ( *((U64*)(ptr & ~0x7) <<ROLL()<< ((ptr & 0x7) * 8) ); but I dont care and Perl doesn't care, thats a CC domain problem, not a Perl in C problem.

In any case, the 2 platforms @thesamesam mentioned have some non-ISO-C-Abstract-Machine CC vendor specific "fast enough" C grammar token to do unaligned mem R/Ws. Someone just has to find it in that C Compiler's man page.

FSF LLC and GNU LLC's proprietary CC vendor specific "fast enough" C grammar token to do unaligned mem R/Ws has the ASCII string name of memcpy. But this is proprietary to the GPL 3 repo published by GNU and isn't blindly universally portable to all other C compilers.

MSVC instead wants you to use *(( __unaligned unsigned __int64 *)p) on WinCE for ARM32 and Ancient WinNT IA64/Alpha/SPARC/MIPS. *(( __unaligned unsigned __int64 *)p) is a syntax error in a .i file for MSVC x64 mode, MSVC WinNT ARM32 mode and MSVC WinNT ARM64 mode. Mingw GCC's and MSVC's .h headers instantly #define __unaligned \n to nothing. MSVC i386 mode .i mode ignored __unaligned but doesn't syntax error like MSVC for x64 / NT ARM32 / NT ARM64. I researched all of this, but never tried to prove this as true. Its irrelavent for me, my C/C++ private code, or Perl C code. I and everyone should just write the __unaligned if MSVC is detected even if it does nothing nowadays and MS's std headers NOOP it away. Maybe one day MS will reintroduce __unaligned as best practice or mandatory. Who knows, its harmless to write it in the 2000s-2025 time period.

If we are arguing about ISO C Abstract Virtual Machine CPU compliace. The C compiler published by GNU/FSF is INVALID and non-conformant to ISO C.

Easy demonstration: Add a char * memcpy(char* d, const char * s, size_t n) {} to perl's util.c file that logs every single call to STDERR. If any executions of memcpy() in Perl's C code, do not appear in my shell console, the C Compiler is NON-CONFORMANT and INVALID and should be promptly blocked and blacklisted in perl's ./Configure.

Perl's only project scope is to run on modern enough hardware, not on the https://en.wikipedia.org/wiki/AN/FSQ-7_Combat_Direction_Central OS or ISO C Abstract Virtual OS or https://en.wikipedia.org/wiki/Apple_II

bulk88 avatar Jun 16 '25 06:06 bulk88

Wrong. After 15 minutes of trying I couldn't find unaligned HW mem R/W in SPARC V7 Sept 1988 ISA.

I was saying that SPARC aggressively traps on unaligned access, not that it has instructions for that.

thesamesam avatar Jun 16 '25 06:06 thesamesam

Wrong. After 15 minutes of trying I couldn't find unaligned HW mem R/W in SPARC V7 Sept 1988 ISA.

I was saying that SPARC aggressively traps on unaligned access, not that it has instructions for that.

Yeah, RISC CPUs are expected and its normal for them to throw an exception/sig handler each time if you use do a vanilla C machine type >1 byte mem read write in C lang on an unaligned addr. My argument is that all CCs have some magic special token/grammer way on how to safely and quickly do unaligned U16/U32/U64 mem reads and writes. So the end user dev just has to write those special magical tokens/grammer things as needed in C src code and the issue goes away. = ( [] << | [] << | [] << | [] ) is the worst possible but still acceptable solution to use in C to deal with UA memory reads/writes of > 1 byte data values.

bulk88 avatar Jun 16 '25 06:06 bulk88

I vaguely remember ARMv4's/WinCE Perl, if someone does an unaligned U16 or U32 mem read on ARMv4, you get a free shift and a free mask as a gift from ARMv4. You don't get a SEGV on ARMv4 from UA mem reads. I think reason was, not going to google it, was b/c this was how ARMv4 wanted you to do UA mem reads.

U32 u32_var =    *(U32*) ua_32p
                            | *((U32*)  (((char*) ua_32p)+3)  );

Not safely and correctly SEGVing like I assumed WinCE Perl on ARM would, but instead finding a junk integer in my MSVC watch window, while step debugging in VS IDE was a learning experience for me.

bulk88 avatar Jun 16 '25 06:06 bulk88

Although it should be noted that memcpy doesn't help with the "unaligned reads are implemented in software and super slow" problem. I don't know which architectures suffer from this.

@bulk88 makes note to self to finally turn on intrinsic memcpy(d,s, n <= 16); optimization for all MSVC compilers. @bulk88 doesn't know why he didn't make that PR on day 1 of his WinPerl in C hobby, since all his day job employer's binaries made with MSVC have it turned on.

bulk88 avatar Jun 16 '25 06:06 bulk88

Although it should be noted that memcpy doesn't help with the "unaligned reads are implemented in software and super slow" problem. I don't know which architectures suffer from this.

A bunch of links to the topic. Since C language UA mem read/writes are Day 1 of Win32 Platform Requirement, All RISC WinNT kernels ever compiled, will trap, emulate and resume the CPU's SIGILL/SIGBUS, with I can't the find the article on google, but it was a 100x or 1000x wall time delay for NT kernel to emulate UA mem reads, vs the C compiler emitting 3-6 more CPU opcodes in the binary when the dev declares an UA C machine type.

A bunch of links, the short summary is, go read your C compiler's docs on how to write your C code correctly using VENDOR SPECIFIC C syntax.

https://devblogs.microsoft.com/oldnewthing/20170814-00/?p=96806 https://devblogs.microsoft.com/oldnewthing/20220810-00/?p=106958 https://devblogs.microsoft.com/oldnewthing/20180409-00/?p=98465 https://devblogs.microsoft.com/oldnewthing/20210611-00/?p=105299 https://devblogs.microsoft.com/oldnewthing/20190821-00/?p=102794 https://devblogs.microsoft.com/oldnewthing/20200103-00/?p=103290 https://devblogs.microsoft.com/oldnewthing/20250605-00/?p=111250 https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/avoiding-misalignment-of-fixed-precision-data-types https://learn.microsoft.com/en-us/windows/win32/winprog64/fault-alignments https://wiki.debian.org/ArmEabiFixes#word_accesses_must_be_aligned_to_a_multiple_of_their_size https://web.archive.org/web/20090204204507/http://lecs.cs.ucla.edu/wiki/index.php/XScale_alignment https://devblogs.microsoft.com/oldnewthing/20040830-00/?p=38013

FWIW, here is a list of instruction my Win 7 x64 Kernel is capable of emulating. IDK why this emulation code exists or why this switch ladder exists in my Win 7 x64 kernel, I don't have a reason to need to know this, Remember this is closed source SW. But there must be bizarre, very rare cases, where for whatever reason, the Win 7 x64 kernel needs to silently fix up the CPU hardware fault/signal/interrut/exception/event handler whatever u call it.

Update, this instruction list may or may not have something to do with an x64 CPU executing temporarily in 16 bit Real Mode, probably for WinNT's Win16 emulator ntvdm.exe. Good enough answer for my curiosity to be satisfied.

``` XmAaaOp XmAadOp XmAamOp XmAasOp XmAccumImmediate XmAccumRegister XmAdcOp XmAddOp XmAddOperands XmBitScanGeneral XmBoundOp XmBsfOp XmBsrOp XmBswapOp XmBtOp XmBtcOp XmBtrOp XmBtsOp XmByteImmediate XmCallOp XmCbwOp XmClcOp XmCldOp XmCliOp XmCmcOp XmCmpsOp XmCmpxchgOp XmCompareOperands XmCwdOp XmDaaOp XmDasOp XmDecOp XmDivOp XmEffectiveOffset XmEmulateStream XmEnterOp XmEvaluateAddressSpecifier XmEvaluateIndexSpecifier XmExecuteInt1a XmFlagsRegister XmGeneralBitOffset XmGeneralRegister XmGetCodeByte XmGetImmediateSourceValue XmGetLongImmediate XmGetOffsetAddress XmGetStringAddress XmGetWordImmediate XmGroup1General XmGroup1Immediate XmGroup2By1 XmGroup2ByByte XmGroup2ByCL XmGroup3General XmGroup45General XmGroup7General XmGroup8BitOffset XmHltOp XmIdivOp XmIllOp XmImmediateEnter XmImmediateJump XmImulImmediate XmImulOp XmImulxOp XmInOp XmIncOp XmInsOp XmInt1aFindPciClassCode XmInt1aFindPciDevice XmInt1aReadConfigRegister XmInt1aWriteConfigRegister XmIntOp XmIretOp XmJcxzOp XmJmpOp XmJxxOp XmLahfOp XmLeaveOp XmLoadSegment XmLodsOp XmLongJump XmLoopOp XmMovOp XmMoveGeneral XmMoveImmediate XmMoveRegImmediate XmMoveSegment XmMoveXxGeneral XmMovsOp XmMulOp XmNegOp XmNotOp XmOpcodeEscape XmOpcodeRegister XmOrOp XmOutOp XmOutsOp XmPopGeneral XmPopOp XmPopStack XmPopaOp XmPortDX XmPortImmediate XmPrefixOpcode XmPushImmediate XmPushOp XmPushPopSegment XmPushStack XmPushaOp XmRclOp XmRcrOp XmRdtscOp XmRetOp XmRolOp XmRorOp XmSahfOp XmSarOp XmSbbOp XmScasOp XmSegmentOffset XmSetDataType XmSetLogicalResult XmSetccByte XmShiftDouble XmShlOp XmShldOp XmShortJump XmShrOp XmShrdOp XmSmswOp XmStcOp XmStdOp XmStiOp XmStosOp XmStringOperands XmSubOp XmSubOperands XmSxxOp XmTestOp XmXaddOp XmXchgOp XmXlatOpcode XmXorOp ```

bulk88 avatar Jun 16 '25 07:06 bulk88

I will also add, Perl/P5P devs might be be able to in C lang, in certain permutations of CC OS and CPU, outsmart a C compiler's native UA mem read/write "best practices" algorithm. The engineering reason is, is it legal, to use 2 aligned U32 mem reads to read an unaligned U32?

What if a unknown unnamed CPU has a hardware MMU virtual memory page size of 1 byte?

You can get one of these CPUs with a 1 byte page size MMU right now for free at https://github.com/google/sanitizers/wiki/addresssanitizer or https://en.wikipedia.org/wiki/Valgrind

In real life, aslong as you aren't over reading at the very end of a malloc block at address 4096-1 or 4096-2, 2 x U32 aligned reads are harmless. Malloc() won't hand out a 4 or 5 byte aligned memory block pressed against a 4096 boundary when you execute malloc(5).

IDK of a real world malloc() impl that hands out raw 4096 aligned mempages for request >= 4096, with a unmapped mem page at `malloc(0x1000)-1.

But some should give the real rev engineered meaning of this OSX API doc, read section "Allocating Large Memory Blocks using Malloc" at https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/MemoryAlloc.html#//apple_ref/doc/uid/20001881-99765

But regarding the perl C API, remember that a fictional malloc(1) pressed against a 4096 boundary is illegal and impossible in the Perl C API. A fictional malloc(1) means SvPVX(sv)[0] with that sv being SvCUR(sv) == 0 || SvCUR(sv) == 1 is not '\0' null terminated, and strlen(SvPVX(sv)) is 100% guaranteed to SEGV. Perl could never execute on such an architecture or such a libc impl.

bulk88 avatar Jun 16 '25 07:06 bulk88

Here is a reason why memcpy() and memcmp() can't be blindly assumed to be a magical UA memory compliance tool and a C dev STILL needs to read their CC's vendor specific docs no matter what.

Here is a list of all call sites to memcmp() symbol in the UCRT dll from my -O1 MSVC blead perl.

A couple calls in 1 func are fine, but anything over 5 calls points to big problems with P5P macros and C code, example of 1 func

S_handle_possible_posix+1013
S_handle_possible_posix+1049
S_handle_possible_posix+107B
S_handle_possible_posix+10A8
S_handle_possible_posix+10D5
S_handle_possible_posix+1102
S_handle_possible_posix+112F
S_handle_possible_posix+1155
S_handle_possible_posix+1176
S_handle_possible_posix+1199
S_handle_possible_posix+F3C
S_handle_possible_posix+FB5
S_handle_possible_posix+FD0
S_handle_possible_posix+FEB

I know someone assumed GCC's inline memcmp() optimization is part of the ISO C TC's reference unit tests for C89. It is not.

full list of all call sites to memcmp() is hidden below

``` PerlIO_find_layer+6C Perl_Gv_AMupdate+2AF Perl_Gv_AMupdate+2D4 Perl__invlistEQ+BB Perl__to_utf8_fold_flags+1EF Perl__to_utf8_fold_flags+242 Perl_allocmy+25C Perl_ck_method+12B Perl_cv_ckproto_len_flags+183 Perl_do_uniprop_match+B5 Perl_do_uniprop_match+DB Perl_fbm_instr+15D Perl_fbm_instr+184 Perl_fbm_instr+245 Perl_fbm_instr+DE Perl_foldEQ_utf8_flags+30B Perl_get_and_check_backslash_N_name+26E Perl_grok_number_flags+437 Perl_grok_numeric_radix+1D0 Perl_gv_add_by_type+8B Perl_gv_autoload_pvn+53 Perl_gv_fetchmeth_pvn_autoload+6C Perl_gv_fetchmethod_pvn_flags+25F Perl_gv_fetchmethod_pvn_flags+D7 Perl_gv_fetchpvn_flags+25E Perl_gv_fullname4+BB Perl_gv_init_pvn+2E6 Perl_gv_setref+51C Perl_hv_common+A80 Perl_hv_ename_add+1A7 Perl_hv_ename_add+D1 Perl_hv_ename_delete+16B Perl_hv_ename_delete+1BF Perl_hv_ename_delete+AA Perl_is_utf8_FF_helper_+A8 Perl_leave_scope+875 Perl_magic_sethook+7B Perl_magic_sethook+AA Perl_magic_setsig+2BA Perl_magic_setsig+2EC Perl_magic_setsig+93 Perl_magic_setsig+DE Perl_mro_isa_changed_in+1E9 Perl_mro_method_changed_in+170 Perl_mro_package_moved+23C Perl_mro_package_moved+FF Perl_ninstr+78 Perl_pad_findmy_pvn+C2 Perl_pp_gelem+122 Perl_pp_gelem+1A1 Perl_pp_gelem+1CB Perl_pp_gelem+1FD Perl_pp_gelem+22F Perl_pp_gelem+257 Perl_pp_gelem+27D Perl_pp_gelem+2A4 Perl_pp_gelem+2FB Perl_pp_gelem+D1 Perl_pp_lc+465 Perl_pp_prototype+64 Perl_pp_ucfirst+30D Perl_pp_ucfirst+6EB Perl_re_intuit_start+22D Perl_re_op_compile+50A Perl_refcounted_he_chain_2hv+C1 Perl_refcounted_he_fetch_pvn+E9 Perl_regexec_flags+A93 Perl_regexec_flags+AB9 Perl_rninstr+90 Perl_save_gp+AE Perl_scan_str+629 Perl_scan_str+660 Perl_scan_str+696 Perl_scan_str+6E4 Perl_sv_cmp_flags+159 Perl_sv_cmp_locale_flags+B6 Perl_sv_eq_flags+1A7 Perl_sv_gets+605 Perl_sv_gets+7F7 Perl_upg_version+1E1 Perl_whichsig_pvn+47 Perl_xs_handshake+119 S_apply_builtin_cv_attribute+45 S_apply_builtin_cv_attribute+69 S_apply_builtin_cv_attribute+92 S_do_chomp+314 S_dofindlabel+2AE S_doopen_pm+D6 S_dopoptolabel+14D S_find_in_my_stash+37 S_force_word+1D4 S_glob_assign_glob+23F S_gv_fetchmeth_internal+269 S_gv_fetchmeth_internal+387 S_gv_fetchmeth_internal+3EC S_gv_magicalize+130 S_gv_magicalize+1A0 S_gv_magicalize+1E7 S_gv_magicalize+203 S_gv_magicalize+224 S_gv_magicalize+244 S_gv_magicalize+2DD S_gv_magicalize+30A S_gv_magicalize+32B S_gv_magicalize+37B S_gv_magicalize+3E3 S_gv_magicalize+40D S_gv_magicalize+435 S_gv_magicalize+456 S_gv_magicalize+4DE S_gv_magicalize+521 S_gv_magicalize+5AF S_gv_magicalize+5EB S_gv_magicalize+60C S_gv_magicalize+65E S_gv_magicalize+698 S_gv_magicalize+6C8 S_gv_magicalize+79F S_gv_magicalize+8D4 S_handle_possible_posix+1013 S_handle_possible_posix+1049 S_handle_possible_posix+107B S_handle_possible_posix+10A8 S_handle_possible_posix+10D5 S_handle_possible_posix+1102 S_handle_possible_posix+112F S_handle_possible_posix+1155 S_handle_possible_posix+1176 S_handle_possible_posix+1199 S_handle_possible_posix+F3C S_handle_possible_posix+FB5 S_handle_possible_posix+FD0 S_handle_possible_posix+FEB S_hv_delete_common+36A S_hv_delete_common+4D8 S_hv_delete_common+53E S_incline+F6 S_is_codeset_name_UTF8+3A S_is_codeset_name_UTF8+78 S_is_codeset_name_UTF8+8F S_is_utf8_overlong+67 S_magic_sethint_feature+11A S_magic_sethint_feature+14E S_magic_sethint_feature+180 S_magic_sethint_feature+1AC S_magic_sethint_feature+1DB S_magic_sethint_feature+20D S_magic_sethint_feature+23F S_magic_sethint_feature+26E S_magic_sethint_feature+29F S_magic_sethint_feature+311 S_magic_sethint_feature+341 S_magic_sethint_feature+375 S_magic_sethint_feature+3A9 S_magic_sethint_feature+3D7 S_magic_sethint_feature+402 S_magic_sethint_feature+42E S_magic_sethint_feature+45E S_magic_sethint_feature+492 S_magic_sethint_feature+4C6 S_magic_sethint_feature+4FA S_magic_sethint_feature+526 S_magic_sethint_feature+552 S_magic_sethint_feature+582 S_magic_sethint_feature+5B2 S_magic_sethint_feature+5DC S_magic_sethint_feature+77 S_magic_sethint_feature+E9 S_maybe_add_coresub+5F6 S_mayberelocate+107 S_mayberelocate+50 S_move_proto_attr+21C S_move_proto_attr+CA S_pad_check_dup+1F1 S_pad_check_dup+B7 S_pad_findlex+DE S_parse_LC_ALL_string+14F S_parse_uniprop_string+1091 S_parse_uniprop_string+10B5 S_parse_uniprop_string+178 S_parse_uniprop_string+1942 S_parse_uniprop_string+31A S_parse_uniprop_string+41B S_parse_uniprop_string+440 S_parse_uniprop_string+548 S_parse_uniprop_string+566 S_parse_uniprop_string+B6D S_parse_uniprop_string+BAF S_parse_uniprop_string+BCD S_parse_uniprop_string+BF5 S_parse_uniprop_string+C0E S_parse_uniprop_string+C29 S_parse_uniprop_string+C87 S_parse_uniprop_string+CE2 S_parse_uniprop_string+D2F S_parse_uniprop_string+D8A S_parse_uniprop_string+E07 S_parse_uniprop_string+E77 S_reg+100D S_reg+1038 S_reg+1063 S_reg+1093 S_reg+10C0 S_reg+10E7 S_reg+110E S_reg+11C0 S_reg+4F1 S_reg+523 S_reg+550 S_reg+576 S_reg+5A7 S_reg+5CE S_reg+64D S_reg+66E S_reg+699 S_reg+6B6 S_reg+6E1 S_reg+706 S_reg+72C S_reg+749 S_reg+77E S_reg+79F S_reg+7CA S_reg+7EB S_reg+94C S_reg+A4F S_reg+FE2 S_regatom+8AF S_regclass+16D7 S_regmatch+15A2 S_regmatch+2AE7 S_regmatch+3568 S_regrepeat+159E S_regrepeat+1674 S_regrepeat+79C S_require_file+13C7 S_require_file+1C0B S_require_file+1C41 S_require_file+6BE S_scan_ident+2FA S_setup_EXACTISH_ST+1006 S_setup_EXACTISH_ST+108F S_setup_EXACTISH_ST+108F S_setup_EXACTISH_ST+10B0 S_setup_EXACTISH_ST+10B0 S_setup_EXACTISH_ST+2123 S_setup_EXACTISH_ST+FC0 S_share_hek_flags+88 S_swallow_bom+D0 S_turkic_fc+43 S_turkic_lc+50 S_turkic_uc+43 S_unpack_rec+1C5B S_unshare_hek_or_pvn+10C XS_NamedCapture_TIEHASH+DD XS_PerlIO_get_layers+14C XS_PerlIO_get_layers+1A8 XS_PerlIO_get_layers+F2 hek_eq_pvn_flags+7B set_w32_module_name+B3 yyl_do+B4 yyl_fake_eof+159 yyl_foreach+1AE yyl_foreach+1EF yyl_foreach+320 yyl_foreach+38F yyl_hyphen+CB yyl_interpcasemod+72 yyl_interpcasemod+A3 yyl_keylookup+109 yyl_leftcurly+7E3 yyl_my+2DC yyl_my+317 yyl_slash+114 yyl_try+860 yyl_try+9EB yyl_try+AB3 yyl_try+B19 ```

bulk88 avatar Jun 16 '25 10:06 bulk88

It also introduces more aliasing issues (e.g. src accessed with both U64* and U32*) which is a shame, as it brings Perl further away from dropping -fno-strict-aliasing.

I don't believe its possible to write production grade C not C++ SW without -fno-strict-aliasing. You would be removing (type_t) operator from C89/C99's grammar to write a non crashing -fyes-strict-aliasing. Perl isn't a .o/TU file holding the hottest guts of a AES MPEG JPG codec where -fyes-strict-aliasing could make a benchmarkable improvement.

Either don't use GCC's -O3 / -O4 flag (-O2 ??? -O1 ???) or understand you have to live with adding the -fno-strict-aliasing flag. JAPHs, GOLFs, and educational demos and samples arent production grade C SW.

If you remove (type_t) and & operators from C, you now have the JavaScript ISA/WASM/ECMAScript/V8 virtual machine and those aren't C, even if an AES decryption lib written in JS/nodeJS happens to benchmark identical to a GNU C AES decryption lib on 1 particular build number of Chrome/Node engine.

bulk88 avatar Jun 16 '25 11:06 bulk88

I'll also add byte order swapping, more specifically swapping the byte order of a 64 bit variable aka htonll()/ntohll(), was a patented technology until 2022 and perl has been doing some low level SW piracy for decades at

https://github.com/Perl/perl5/blob/blead/perl.h#L4601

see this patent by Intel and compare it to Perl's math operator's above

https://patents.google.com/patent/US20040010676A1/en

cough cough pro- or anti-patent troll cough cough

bulk88 avatar Jun 16 '25 16:06 bulk88

MSVC fix branch pushed at https://github.com/Perl/perl5/pull/23374

Reasons for all the MSVC specific code is, here is very poor code gen round 1, after getting the syntax errors fixed.

48 8B D1                   mov     rdx, rcx
4C 8B C7                   mov     r8, rdi         ; Size
48 8D 4D F7                lea     rcx, [rbp+57h+var_60] ; Dst
FF 15 8E 85 09 00          call    cs:__imp_memcpy
48 8B 55 DF                mov     rdx, [rbp+57h+Src] ; Src
48 8D 4D EF                lea     rcx, [rbp+57h+Dst] ; Dst
4C 8B C7                   mov     r8, rdi         ; Size
FF 15 7D 85 09 00          call    cs:__imp_memcpy
48 8B 55 F7                mov     rdx, [rbp+57h+var_60]
41 B9 00 FF 00 00          mov     r9d, 0FF00h
4C 8B C2                   mov     r8, rdx
48 8B C2                   mov     rax, rdx
48 C1 E8 10                shr     rax, 10h
4D 23 C7                   and     r8, r15
4C 0B C0                   or      r8, rax
48 8B CA                   mov     rcx, rdx
49 C1 E8 10                shr     r8, 10h
48 8B C2                   mov     rax, rdx
48 23 C3                   and     rax, rbx
48 C1 E1 10                shl     rcx, 10h
4C 0B C0                   or      r8, rax
41 BA 00 00 FF 00          mov     r10d, 0FF0000h
49 C1 E8 10                shr     r8, 10h
48 8B C2                   mov     rax, rdx
49 23 C5                   and     rax, r13
4C 0B C0                   or      r8, rax
48 8B C2                   mov     rax, rdx
49 23 C1                   and     rax, r9
49 C1 E8 08                shr     r8, 8
48 0B C8                   or      rcx, rax
48 8B C2                   mov     rax, rdx
48 C1 E1 10                shl     rcx, 10h
49 23 C2                   and     rax, r10
48 0B C8                   or      rcx, rax
49 23 D6                   and     rdx, r14
48 C1 E1 10                shl     rcx, 10h
48 0B CA                   or      rcx, rdx
48 8B 55 EF                mov     rdx, [rbp+57h+Dst]
48 C1 E1 08                shl     rcx, 8
48 8B C2                   mov     rax, rdx
4C 0B C1                   or      r8, rcx
48 C1 E8 10                shr     rax, 10h
4C 89 45 F7                mov     [rbp+57h+var_60], r8
48 8B CA                   mov     rcx, rdx
48 C1 E1 10                shl     rcx, 10h
4C 8B C2                   mov     r8, rdx
4D 23 C7                   and     r8, r15
4C 0B C0                   or      r8, rax
48 8B C2                   mov     rax, rdx
48 23 C3                   and     rax, rbx
49 C1 E8 10                shr     r8, 10h
4C 0B C0                   or      r8, rax
48 8B C2                   mov     rax, rdx
49 23 C5                   and     rax, r13
49 C1 E8 10                shr     r8, 10h
4C 0B C0                   or      r8, rax
48 8B C2                   mov     rax, rdx
49 23 C1                   and     rax, r9
49 C1 E8 08                shr     r8, 8
48 0B C8                   or      rcx, rax
48 8B C2                   mov     rax, rdx
48 C1 E1 10                shl     rcx, 10h
49 23 D6                   and     rdx, r14
49 23 C2                   and     rax, r10
48 0B C8                   or      rcx, rax
48 C1 E1 10                shl     rcx, 10h
48 0B CA                   or      rcx, rdx
48 8D 55 EF                lea     rdx, [rbp+57h+Dst] ; Src
48 C1 E1 08                shl     rcx, 8
4C 0B C1                   or      r8, rcx
48 8B 4D CF                mov     rcx, [rbp+57h+e] ; Dst
4C 89 45 EF                mov     [rbp+57h+Dst], r8
4C 8B C7                   mov     r8, rdi         ; Size
FF 15 98 84 09 00          call    cs:__imp_memcpy
48 8B 4D DF                mov     rcx, [rbp+57h+Src] ; Dst
48 8D 55 F7                lea     rdx, [rbp+57h+var_60] ; Src
4C 8B C7                   mov     r8, rdi         ; Size
FF 15 87 84 09 00          call    cs:__imp_memcpy
48 01 7D DF                add     [rbp+57h+Src], rdi
48 03 F7                   add     rsi, rdi
4C 2B E7                   sub     r12, rdi
48 8B 4D CF                mov     rcx, [rbp+57h+e]
49 8B C4                   mov     rax, r12
48 2B CF                   sub     rcx, rdi
48 2B C6                   sub     rax, rsi
48 89 4D CF                mov     [rbp+57h+e], rcx
48 83 F8 10                cmp     rax, 10h
0F 83 C4 FE FF FF          jnb     loc_140090DA2

here is round 2 after turning on inline memcpy, still bad code gen


4D 8B 0C 13             mov     r9, [r11+rdx]
49 83 EA 08             sub     r10, 8
49 8B 14 0B             mov     rdx, [r11+rcx]
48 83 C7 08             add     rdi, 8
4C 8B C2                mov     r8, rdx
48 8B C2                mov     rax, rdx
48 C1 E8 10             shr     rax, 10h
48 8B CA                mov     rcx, rdx
48 C1 E1 10             shl     rcx, 10h
4C 23 C3                and     r8, rbx
4C 0B C0                or      r8, rax
48 8B C2                mov     rax, rdx
49 23 C6                and     rax, r14
49 C1 E8 10             shr     r8, 10h
4C 0B C0                or      r8, rax
48 8B C2                mov     rax, rdx
48 23 C6                and     rax, rsi
49 C1 E8 10             shr     r8, 10h
4C 0B C0                or      r8, rax
48 8B C2                mov     rax, rdx
49 23 C4                and     rax, r12
49 C1 E8 08             shr     r8, 8
48 0B C8                or      rcx, rax
48 8B C2                mov     rax, rdx
49 23 C5                and     rax, r13
48 C1 E1 10             shl     rcx, 10h
48 0B C8                or      rcx, rax
49 23 D7                and     rdx, r15
48 8B 45 58             mov     rax, [rbp+dsv]
48 C1 E1 10             shl     rcx, 10h
48 0B CA                or      rcx, rdx
49 8B D1                mov     rdx, r9
48 C1 E1 08             shl     rcx, 8
48 23 D3                and     rdx, rbx
4C 0B C1                or      r8, rcx
49 8B C9                mov     rcx, r9
4C 89 00                mov     [rax], r8
49 8B C1                mov     rax, r9
48 C1 E8 10             shr     rax, 10h
48 0B D0                or      rdx, rax
48 C1 E1 10             shl     rcx, 10h
48 C1 EA 10             shr     rdx, 10h
49 8B C1                mov     rax, r9
49 23 C6                and     rax, r14
48 0B D0                or      rdx, rax
49 8B C1                mov     rax, r9
48 23 C6                and     rax, rsi
48 C1 EA 10             shr     rdx, 10h
48 0B D0                or      rdx, rax
49 8B C1                mov     rax, r9
49 23 C4                and     rax, r12
48 C1 EA 08             shr     rdx, 8
48 0B C8                or      rcx, rax
49 8B C1                mov     rax, r9
48 C1 E1 10             shl     rcx, 10h
49 23 C5                and     rax, r13
48 0B C8                or      rcx, rax
4D 23 CF                and     r9, r15
48 C1 E1 10             shl     rcx, 10h
49 8B C2                mov     rax, r10
49 0B C9                or      rcx, r9
48 2B C7                sub     rax, rdi
48 C1 E1 08             shl     rcx, 8
48 0B D1                or      rdx, rcx
48 8B 4D 60             mov     rcx, [rbp+arg_18]
48 89 11                mov     [rcx], rdx
48 83 C1 08             add     rcx, 8
48 8B 55 58             mov     rdx, [rbp+dsv]
48 83 EA 08             sub     rdx, 8
48 89 4D 60             mov     [rbp+arg_18], rcx
48 89 55 58             mov     [rbp+dsv], rdx
48 83 F8 10             cmp     rax, 10h
0F 83 06 FF FF FF       jnb     loc_14009093C

after all of my tweaks it looks perfect or identical to whatever GCC and Clang would emit

4B 8B 04 08             mov     rax, [r8+r9]
48 83 EA 08             sub     rdx, 8
4B 8B 0C 10             mov     rcx, [r8+r10]
48 83 C7 08             add     rdi, 8
48 0F C8                bswap   rax
49 89 02                mov     [r10], rax
4D 8D 52 F8             lea     r10, [r10-8]
48 8B C2                mov     rax, rdx
48 2B C7                sub     rax, rdi
48 0F C9                bswap   rcx
49 89 09                mov     [r9], rcx
4D 8D 49 08             lea     r9, [r9+8]
48 83 F8 10             cmp     rax, 10h

bulk88 avatar Jun 17 '25 10:06 bulk88

BEFORE 5.41.7

C:\sources\perl5>timeit C:\pb64\bin\perl.exe  -e "my $x = \"X\"x(1024*1000*10);
my $y; for (0..1_000) { $y = reverse \"foo\",$x }"
Exit code      : 0
Elapsed time   : 10.20
Kernel time    : 2.76 (27.1%)
User time      : 7.33 (71.9%)
page fault #   : 2511948
Working set    : 35628 KB
Paged pool     : 92 KB
Non-paged pool : 7 KB
Page file size : 31732 KB

AFTER

C:\sources\perl5>timeit perl -Ilib -e "my $x = \"X\"x(1024*1000*10); my $y; for
(0..1_000) { $y = reverse \"foo\",$x }"
Exit code      : 0
Elapsed time   : 5.64
Kernel time    : 2.96 (52.6%)
User time      : 2.67 (47.3%)
page fault #   : 2511998
Working set    : 35808 KB
Paged pool     : 94 KB
Non-paged pool : 8 KB
Page file size : 31752 KB

bulk88 avatar Jun 17 '25 10:06 bulk88

My PR at #23330 by coincidence is trying to ADD THE INTRINSICS to the Perl C API, that THIS PR WANTS to use. But the ticket has stalled.

50% of the reason there is no progress, is b/c I input or a vote is needed on the best name/identifier to use for a new macro.

50% of the reason is claims that Perl 5 has no valid production code reason to need to swap ASCII/binary bytes in the C level VM at runtime.

I don't actually thing we need ntohll and htonll Originally posted by @Leont in #23330 (comment)

I'll reiterate this PR as written, and my MSVC patch ontop of the Richard branch that makes this PR pass all GI smoke tests, is still an unclean design. Perl_pp_reverse() shouldn't been keeping platform CC OS secrets inside it on how to do a fast efficient CPU native htonll(). There needs to be a centralized formal Perl C platform API how to do this as described in https://github.com/Perl/perl5/pull/23330 .

After I reproduced on Win64 the huge performance boost Richard showed in the OP on Linux, this optimization is absolutely needed. I very often use PP reverse to write PP code that de-dupes groups of '\0' terminated C strings, so shorter C strings, that are perfect suffixes of longer C strings, are de-duped into the longer C strings. Example, short C string "QUERY_METADATA_HANDLE" becomes "EXTRA_QUERY_METADATA_HANDLE"+5 in the src code.

bulk88 avatar Jun 18 '25 18:06 bulk88

Just trying to understand where this PR is at now:

  • I think it the two wide-reverse sections can (probably) be merged with a bit of refactoring.
  • I should use the my_swapxx functions instead of _swab_xx_ ?
  • Windows support would have to wait for #23330 ?
  • Is the current state of this PR non-portable anywhere else that we are aware of?

richardleach avatar Jun 18 '25 22:06 richardleach

I can't speak for Windows but the state of the code with https://github.com/Perl/perl5/pull/23374 on top looks right to me.

thesamesam avatar Jun 18 '25 22:06 thesamesam

Just trying to understand where this PR is at now:

  • I think it the two wide-reverse sections can (probably) be merged with a bit of refactoring.

No opinion. 2 x builtin_bwapU32() #ifdef branch probably needs to stay. Remember 64b IVs on i386 and FooCPU32 are CC emulated. 95% chance FooCPU32 arch won't have a native HW 64 bit asm lang byteswap CPU opcode. RIP AMD x32 and Alpha 32 archs probably have builtin_bwapU64(). IDK/IDC what GCC's and Clang's policy is for SW emulating builtin_bwapU64() on 32 bit ptr CPU archs.

Plus IDK how good or bad GCC's and Clang's SW emulating builtin_bwapU64() on 32 bit ptr CPU archs. I know modern blead perl requires a U64 type on all build permutations. But that P5P requirement says absolutely nothing about fast/slow-ness of FooCC's impl of U64 type on 32 bit ptr CPU archs.

  • I should use the my_swapxx functions instead of _swab_xx_ ?

Yes, use whatever P5P abstraction ./Configure/perl.h/handy.h/util.c/perl.c/inline.h want you to use. Don't DIY something new just for pp_reverse that nothing else in Perl ecosystem will ever use or know about.

Meh.

It is a cleaner git history, less repo noise, less git blame noise, if the above PR with a centralized byte swapping API is pushed to blead, before this PR is pushed to blead. I don't personally care in what order this PR vs Win32: htonl/htons/ntohl/ntohs change slow winsock exports -> 1 CPU op/ins #23330 is approved and applied to blead. If Win32: htonl/htons/ntohl/ntohs change slow winsock exports -> 1 CPU op/ins #23330 is bike/yak shaving, but this Richard PR is dead quiet, this Richard PR should go first.

Not bulk88's problem to manage logistics, I am not PSC/the pumpking.

  • Is the current state of this PR non-portable anywhere else that we are aware of?

GCC project authors do not currently have a monopoly on the ISO C TC, and will never obtain that monopoly post ~2015 / ~2020. They can't undo their GPL 3 change. But in various statements, I'm not hunting for links ATM, GCC project authors said memcpy() is the universal serializing operator for ISO C.

Some senior GCC dev said in a WWW post/article, GCC's policy is memcpy() the direct equivalent of .toString() or .toJSON(), and GCC's policy/position statement is that any attempt to use ISO C's union grammar token, or attempting to use a typedef with a #pragma packed(1) and a union, either on GCC or any CC, is against GCC's official policy/position statement.

IDK and I am not going to research, what the other "C compiler project devs" technical policy/position statements are. I'm not a lawyer collecting discovery, depositions, and exhibits for trial.

Read my "attempting to use a typedef with a #pragma packed(1) and a union" sentence very carefully. Some ISO C/ANSI/IETF/IEEE/FAANG related humans say that is correct and GCC project is blatantly wrong.

Assuming memcpy() is ISO C's synonym for fast unaligned mem reads/writes in ASM lang, is highly illegal !!!

Reason is because inlining libc.sos external linkage memcpy() symbol violates all of the /bin/ld rules and all of the LD_PRELOAD/ELF Sym Interposition rules.

Real world dangers, what will these compilers do when blead Perl does a memcpy(d, s, 4); ?

-TinyCC -HPUX's aCC -Solaris, whatever CC it uses -VMS, whatever CC it uses -IBM's Z/OS's and AIX's xLC https://en.wikipedia.org/wiki/IBM_XL_C/C%2B%2B_Compilers -MSVC [contact @bulk88 to quickly fix it] -Arm LLC's ArmCC -Intel C for GCC/Posix enviroment -Intel C for MSVC [egh .... @bulk88 hasn't tried a build for 7 years, @bulk88 's icc is from 2013, worst case ICC 4 VC will do exactly what mainline MSVC will do]

What other Unix C compilers did I miss?

bulk88 avatar Jun 18 '25 23:06 bulk88

Some senior GCC dev said in a WWW post/article, GCC's policy is memcpy() the direct equivalent of .toString() or .toJSON(), and GCC's policy/position statement is that any attempt to use ISO C's union grammar token, or attempting to use a typedef with a #pragma packed(1) and a union, either on GCC or any CC, is against GCC's official policy/position statement.

IDK and I am not going to research, what the other "C compiler project devs" technical policy/position statements are. I'm not a lawyer collecting discovery, depositions, and exhibits for trial.

Read my "attempting to use a typedef with a #pragma packed(1) and a union" sentence very carefully. Some ISO C/ANSI/IETF/IEEE/FAANG related humans say that is correct and GCC project is blatantly wrong.

Assuming memcpy() is ISO C's synonym for fast unaligned mem reads/writes in ASM lang, is highly illegal !!!

Reason is because inlining libc.sos external linkage memcpy() symbol violates all of the /bin/ld rules and all of the LD_PRELOAD/ELF Sym Interposition rules.

According to https://developer.arm.com/documentation/ka003038/latest/ ARM LLC's official policy is typedef/packed/union/pointer cast is the correct C lang tool to use for safe unaligned memory access on their platform. They omitted any reference to GCC's policy/opinion of using linker function symbol memcpy(). I'll note, this makes sense, since ARM corporate probably isn't in a position, to tell end users in their API docs, to call some unknown author unspecified libc implementation's of memcpy() on an unknown OS, that ARM and ARM C/C++ Compiler's devs' have little to no control over.

bulk88 avatar Jun 22 '25 20:06 bulk88

Assuming memcpy() is ISO C's synonym for fast unaligned mem reads/writes in ASM lang, is highly illegal !!!

Reason is because inlining libc.sos external linkage memcpy() symbol violates all of the /bin/ld rules and all of the LD_PRELOAD/ELF Sym Interposition rules.

Real world dangers, what will these compilers do when blead Perl does a memcpy(d, s, 4); ?

https://github.com/torvalds/linux/blob/v3.13/arch/x86/vdso/vclock_gettime.c#L272

Linux doesn't trust GCC's memcpy() identifier.

bulk88 avatar Jun 29 '25 21:06 bulk88

The kernel is in a different position to Perl where the kernel will get build failures w/o inlined memcpy.

thesamesam avatar Jun 29 '25 23:06 thesamesam

https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

bulk88 avatar Jul 17 '25 14:07 bulk88

Unless there are any showstopping mistakes in this PR, I'd like to merge it soon.

(Any further finessing can come in follow-up PRs.)

richardleach avatar Oct 26 '25 19:10 richardleach